Large file efficiency

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Large file efficiency by Fletch (Bishop) on Feb 08, 2006 at 01:26 UTC
The former builds a frelling huge temporary list on the stack and then copies that into `@array` in one swell foop, while the later pushes one line at a time. Of course a better question might be: why are you loading an entire 1.6G file into RAM? It's more efficient to process things a line at a time (or maybe a record at a time, depending on the structure) if at all possible rather than slurping it all in at once. There may also be things you can do such as writing to a Berkeley DB file or an RDBMS and then process using that instead. But that would take more information about exactly what you're trying to do with your 1.6G.	[reply] [d/l]
Re: Large file efficiency by talexb (Chancellor) on Feb 08, 2006 at 02:17 UTC
Since the beginning of its existence, Perl has been specifically tuned to work with large files .. but it's still possible to be inefficient when dealing with large files, you have shown. ;) The questions you need to ask yourself are, Do you need to have the entire 1.6G file in memory, or do you only need some of the records? If you need all of the records, can you pick out the parts of the line that you're interested as the lines go by? If you can give us a better idea of what process you're going through, we can probably give you a better idea how to use Perl effectively and efficiently. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re: Large file efficiency by BrowserUk (Patriarch) on Feb 08, 2006 at 07:52 UTC
Leaving aside why you are loading the entire file into memory, some algorithms do require that. Based purely upon observation of my system's behaviour `@array = <INFILE>;` needs room for (at least) two copies of the data. First the data is placed on Perl's stack. Then the array is allocated and the data is copied into it. Then the stack is freed. `push @array, $_ while <INFILE>;` requires 1 copy + a bit. The stack only ever holds one line at a time. The array will be grown in stages, with copying required, but ultimately it uses less. The upshot on my system is that loading a 1 million line/10 MB file using method 1 requires nearly 9 seconds and 125 MB of ram; whereas using method 2 requires under 1.5 seconds and 47 MB of ram. Not definitive, and if your algorithm requires it, it's worth running a simple test on your own system for confirmation, but is seems the latter method has no downsides to me. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Large file efficiency by Ewok (Initiate) on Feb 08, 2006 at 16:33 UTC
Since I am working on a cluster, I'm at the mercy of the network for file reads. That means that, usually, one large data read is faster than many many small reads. Furthermore, since I am pruning data files, I need to adjust the headers, and being rather new at this the best way seemed to be to slurp in the data, prune out what I don't want, adjust the headers, and write the data out. For files small enough to be kept within my ram (16G) this method is a factor of 10 faster than reading and writing one line at a time, then going back and adjusting my headers, which was yet another read. However, when the slurp sends me to swap, life slows way down.	[reply]
Re: Large file efficiency by dokkeldepper (Friar) on Feb 08, 2006 at 11:52 UTC
search: slurp	[reply]