ashnator has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have got 2 files which I need to parse based on certain features but the files are too big as much as 3 GB so I am unable to use array or even storage variables.

The format of the 2 files with example is :-

1) File format of one file :
>Harvard 32384743 234394583 John1 15.T >MIT 13249304 545924582 Smith32 7.A >Cambridge 76323823 983438434 Gold1234 17.G
2) File format of the second file is :
>John1 40 34 40 40 25 40 40 40 40 17 40 40 40 20 40 40 40 20 40 40 40 30 40 4 +0 19 40 40 40 37 40 11 40 40 35 25 40 >Smith32 40 40 44 13 40 40 40 50 40 40 40 40 50 40 40 40 16 40 6 40 40 45 40 40 + 40 2 40 40 40 40 29 40 40 40 6 40 >Gold1234 40 40 15 40 39 40 40 40 40 66 40 40 35 40 40 40 10 40 40 40 40 27 40 4 +0 40 12 40 40 33 40 40 40 40 4 40 40 --------------------------- END -------------------------
Now this 15.T , 7.A and 17.G are the locations in the second files. e.g, 15.T means 15th position in John1 record of file file 2. Now I have to apply this formula that any locations score should be >= 20. If so I have to display its name in the output file:-

For Example 15.T means 15th location in John1 record in File 2. Since the 15th position is 40 which is greater than 20 my result should come like this:-

3) Output File
John1 15.T 40

My PERL knowledge is basic so I would be obliged to get help from Monks. Please remember that I cannot store anything in arrays or variables since I have to parse 3 GB file.

Thanks

Replies are listed 'Best First'.
Re: Parsing of 3 GB File
by Perlbotics (Archbishop) on Oct 19, 2008 at 09:41 UTC

    Your question raises some more questions like....

    • How big is file1? If the keys fit into memory, a possible solution could be to read all the John1 => "15.T" type keys into a hash and then parse file2 line by line...
    • Does the order of entries in the output file play any role?
    • May there be multiple matches and how to treat them?
    • Do you count the indexes starting with 0 or 1?
    • What is the significance of the dot+character (.T) after the index (15)? E.g., do they select a different file2?
    • Is this a one-time task? If not, why not start using a database?
    • Why not use a database?
    • What have you done so far (show some Perl code or describe your approach)?
    • Since this is a common question, did you use Super Search? You might find something like Searching Huge files.
    • Why shouting at Perl? It is mostly harmless ;-)
    • ...
Re: Parsing of 3 GB File
by wfsp (Abbot) on Oct 19, 2008 at 09:42 UTC
    You want to look up a value in the second file based on a key in the first file.

    For the look up table a good bet would be to load the second file into DBM::Deep.

    You would then simply read a line at a time from the first file looking up the data you need in the db as you go. hth

Re: Parsing of 3 GB File
by CountZero (Bishop) on Oct 19, 2008 at 14:32 UTC
    A database is the best solution. Dumping the data of file2 into a database is easy: you only need 2 fields per record: name and scores and you index on the "name" field, so look-up will be fast.

    Then you go through file1 line by line and extract the last 2 fields and select the record with the name you are looking for. You split the scores field on whitespace and get the item indicated by the last field in the record of file 1 you are working with.

    If you cannot use a database you may have to index your file2 yourself useing tell and find the records with seek.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Parsing of 3 GB File
by Anonymous Monk on Oct 19, 2008 at 11:46 UTC
    Are these files in .fasta and .qual format? If so, you can use a bioperl module (Bio::SeqIO) that will read a sequence at a time, so that you can analyze each sequence and take it off the memory afterwards.
Re: Parsing of 3 GB File
by binf-jw (Monk) on Oct 19, 2008 at 12:36 UTC
    If they are as the previous post said fasta or qual format then bioperl is next to none for parsing sequences formats (although will complain they don't look like sequences if they have funny characters in there 'looks like you're using scores').
    You say: 'I cannot store anything in arrays or variables since I have to parse 3 GB file.'.
    'The possible size of an array is only limited by how much memory you have'. Could you possibly post the code you've tried so we can see how you're storing stuff.
    You may be able to combat the problem by using references.

    Storing ref's of nucleotides:
    'Example, would need to see your code to tailor it better':
    my $a_ref = \'A'; my $c_ref = \'C'; my $t_ref = \'T'; # etc... # Then storing these values in an array reference: my $base_ref; while ( <$fh> ) { # Get the correct values you need my $nuc_ref = $base eq 'A' ? $a_ref : $base eq 'C' ? $c_ref : $base eq 'G' ? $g_ref : $base eq 'T' ? $t_ref : $n_ref; push @{$base_ref}, $nuc_ref; }

    What is this doing?
    Well now each element in the array is now just a reference to ( A, T, C, G, N ), and not a char in each element. See perlreftut for more info*

    You may still have some trouble with upper limit of arrays etc. but i've read in about 10,000 files to a single data structure before without a hitch. It's all about how you do it.
    Update: If I've seriously overlooked something please say.
    If you could post some examples of what you've tried we may be able to streamline it.
    Hope that helps-
    john

    Ps. First post had a good idea about database
    * See perlreftut for more information not sure if I can explain it that well
Re: Parsing of 3 GB File
by hangon (Deacon) on Oct 19, 2008 at 14:58 UTC

    I have to agree with using a database. If you don't have one readily available, just install the SQLite module. This module has the database embedded and unlike database servers like MySQL, it's an easy no-configuration install.

Re: Parsing of 3 GB File
by JavaFan (Canon) on Oct 19, 2008 at 19:08 UTC
    My PERL knowledge is basic so I would be obliged to get help from Monks.

    Is it just that you need Perl specific help? Or do you not have any clue how to attack this problem, even if you could write it in a language you do master?

    I'd join the people who say that a database may be the best way to take (although I'm willing to change my mind if I learn more details about your situation), but I'd suggest that solution regardless of the language you'd use.

Re: Parsing of 3 GB File
by talexb (Chancellor) on Oct 20, 2008 at 02:02 UTC
      I have got 2 files which I need to parse based on certain features but the files are too big as much as 3 GB so I am unable to use array or even storage variables.

    So if the file is too big for one approach, you need to use another approach. I don't see what the problem is.

    Reading some file into memory is all well and good if you have enough memory, but if it doesn't fit in memory, that's not the end of the known constellations of solutions. Obviously, you have to do some sort of paging to disk, either yourself or using a database. Just go from there.

    Don't let one obstacle stop you from making progress on your problem, what ever it is. Try lots of different approaches, and remember Edison's attitude when someone asked him on his progress to find a suitable material for the filament of an electric light bulb. He didn't say he'd failed 400 times, he said he'd tried 400 different possibilities and hadn't yet found the correct one.

    Onward!

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      over 4700 experiments