Re: Using less memory with BIG files

Extending on Moritz’s idea a little bit more, another trick is to scan the file stem-to-stern once, noting where the “important pieces” begin and end, and what the “key values” are that you will use when searching for those records. Insert the keys into a hash, with a file-position (or a list of file-positions) as the value. Then, after this one sequential pass through the entire file, you can seek() randomly to those positions at any time thereafter. (If along the way you have noted both the starting-position and the size of the entry, you can “slurp” any particular record into, say, a string variable fairly effortlessly.) This is a useful technique to apply to files that are “loosely” structured, as this one seems to be.

Now, if you happen to know that the two files are sorted, and specifically that they are sorted the same way ... if you can positively assert based on some outside knowledge that this is true, and that this always will be true, with regard to these files ... then your logic becomes a good bit simpler because you can simply read the two files sequentially and do everything in just one forward pass, just as they used to do when the only mass-storage device of any reasonable size that you had at your disposal was a tape-drive. It would be too-messy to sort them yourself, and maybe you do not want to risk that they might be, ahem, “out of sorts,” but it’s a handy trick to use (and, bloody fast ...) when you know that you can.

Replies are listed 'Best First'.
Re^2: Using less memory with BIG files by jemswira (Novice) on Feb 02, 2012 at 14:51 UTC
So from what i see, i should be taking the IDs from the Pfam-A.seed file and putting them in a hash. but there's two parts to the important info from the Pfam-a.seed file. the first is the ID, the second is the group name. There's like a 1000 groups in the file, and several million IDs in the first file. Wouldnt memory be a problem? Ok to be 100% honest, I don't fully understand everything going on now. Would you mind guiding me a bit here? Sorry, but I only learned started Perl recently	[reply]

Replies are listed 'Best First'.

Re^2: Using less memory with BIG files
by jemswira (Novice) on Feb 02, 2012 at 14:51 UTC

So from what i see, i should be taking the IDs from the Pfam-A.seed file and putting them in a hash. but there's two parts to the important info from the Pfam-a.seed file. the first is the ID, the second is the group name. There's like a 1000 groups in the file, and several million IDs in the first file. Wouldnt memory be a problem?

Ok to be 100% honest, I don't fully understand everything going on now. Would you mind guiding me a bit here? Sorry, but I only learned started Perl recently

[reply]