comment on

This is an update on my investigations regarding using ':raw' and sysread(). It is also an apology to most of the monks. The question I have been pursuing for 3 days is:

Why did I consistantly see such dramatic performance improvements with each of the steps outlined in my root node when (apparently) one or two others that tried this out, failed to see similar improvements?

Having persued various avenues including memory allocation/deallocation, mismatched C-runtime libraries/.DLLs, and having completely blown away various non-standard compilers and other tools and finally re-installed perl 5.8.2.

Despite all this, I still see the same dramatic performance increases on my system from each of the steps outlined.

Finally, I know why!

The reason is newlines--or rather, the absence of them.

When I generated my testdata, the 3GB file, I did it using a one-liner than generated random length strings (max. 10,000 as mentioned in the original post upon which I based the idea) of random sequences of A, C, G & T. and kept printing new random lines until the accumulated lengths totalled 3GB+.

Hardly efficient, but it was a one-off (all smaller test files were simply head -c 10485760 3GB.dat > 10MB.dat etc.), easy to type and I was going to watch a movie while it ran anyway.

However, it appears that I omitted one thing, my customary -l. Which meant that none of my test files contained a single newline. And that explains everything.

And so, whilst using ':raw' and sysread do indeed provide some fairly beneficial performance improvements (used correctly), the level of those improvements is far less dramatic than my original post showed, if the file does contain newlines.

And so, I apologies to the community of perlmonks for this misinformation.

My sincerest apologies, BrowserUk.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

In reply to Re: Optimising processing for large data files. (Apology and explaination) by BrowserUk
in thread Optimising processing for large data files. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks