in reply to Re: Re: What's the most efficient way to write out many lines of data?
in thread What's the most efficient way to write out many lines of data?

I noticed a couple of things about your code that might help to speed things up a little.

The first is that you are striping trailing spaces from your fields with a regex. 80+ calls into the regex engine per line is going to be quite expensive, and is probably unnecessary. You don't show us what unpack tempate you are using, but if you can use the template char 'A' to unpack your fields, then there is no need to take additional steps to trim trailing spaces as this will be done for you. Eg.

print "'$_' " for unpack '(A5)5', 'abcdeabcd abc ab a '; 'abcde' 'abcd' 'abc' 'ab' 'a';

You would need 5.8 in order to use the '(Ann)*' syntax, but using earlier versions of perl, you can achieve the same effect using

my $template = 'A5' x 80;

Also, the way you are building your $record var is less efficient that it could be. Once you have removed the need for the regex, you can more simply CSVify the fields using join, reducing the body of the while loop to

print '"' . join( '","', unpack '(A15)86', $_ ), "\"\n";

Thus removing the need for the intermediates @values, $record and $field and a chop which should further improve things.

This assumes that your fields don't contain any "s that would need escaping as is indicated by your code.

You seem to be running without strict and without using my. It worth noting that lexical vars are generally faster that globals, although if the above changes are possible it pretty much removes the need for either.

Your idea of accumulating 100 or so lines of output together before printing them is likely to backfire. Given the length of your lines, building up a buffer of 130k in several hundred (80-fields x100) steps is likely to cause lots of reallocing and copying of memory. NOTE: This is speculation. It may be that perl is clever enough to re-use the largest lump of memory for the second and subsequent lines, but given the step wise manner in which it would be accumulated, it probably isn't.

It probably would be worth while ensuring that you have buffering turned on for STDOUT. Perl is probably already quite adept at buffering the output in a fairly optimal fashion given the chance.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Replies are listed 'Best First'.
Re: Re: Re: Re: What's the most efficient way to write out many lines of data?
by Anonymous Monk on Jul 11, 2003 at 15:49 UTC

    Thanks a million for your input. I actually thought the regular expression to trim spaces might be a culprit, but I did not know that unpack would automatically trim trailing spaces. All the data I'm splitting is character data, so A should work just fine (if I understand unpack correctly.) :) I do have to ask, though, does unkpack trim leading spaces? That could be a problem, since some of the data in the fields is positional.

    Put your speculation to rest...accumulating the records into a string and then writing in bulk does indeed backfire. It slows the process down...not much, but it does impact it negatively.

    I was fairly certain Perl is optimized to buffer output efficiently, but I'm testing with 500K records and I need to really use this script on millions of records...so I want to get the time down as much as possible. :)

    Thanks again for all the suggestions...I'll keep you posted as I implement your suggestions. :)

    --Larry