Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Optimising processing for large data files. (Apology and explaination)

by BrowserUk (Patriarch)
on Apr 14, 2004 at 09:01 UTC ( [id://344959]=note: print w/replies, xml ) Need Help??


in reply to Optimising processing for large data files.

This is an update on my investigations regarding using ':raw' and sysread(). It is also an apology to most of the monks. The question I have been pursuing for 3 days is:

Why did I consistantly see such dramatic performance improvements with each of the steps outlined in my root node when (apparently) one or two others that tried this out, failed to see similar improvements?

Having persued various avenues including memory allocation/deallocation, mismatched C-runtime libraries/.DLLs, and having completely blown away various non-standard compilers and other tools and finally re-installed perl 5.8.2.

Despite all this, I still see the same dramatic performance increases on my system from each of the steps outlined.

Finally, I know why!

The reason is newlines--or rather, the absence of them.

When I generated my testdata, the 3GB file, I did it using a one-liner than generated random length strings (max. 10,000 as mentioned in the original post upon which I based the idea) of random sequences of A, C, G & T. and kept printing new random lines until the accumulated lengths totalled 3GB+.

Hardly efficient, but it was a one-off (all smaller test files were simply head -c 10485760 3GB.dat > 10MB.dat etc.), easy to type and I was going to watch a movie while it ran anyway.

However, it appears that I omitted one thing, my customary -l. Which meant that none of my test files contained a single newline. And that explains everything.

And so, whilst using ':raw' and sysread do indeed provide some fairly beneficial performance improvements (used correctly), the level of those improvements is far less dramatic than my original post showed, if the file does contain newlines.

And so, I apologies to the community of perlmonks for this misinformation.

My sincerest apologies, BrowserUk.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://344959]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-03-29 06:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found