Re: What's the most efficient way to write out many lines of data?
by Zaxo (Archbishop) on Jul 09, 2003 at 18:28 UTC
|
You didn't say whether you slurp the data file into an array, or read it one line at a time. The latter will be faster for large files.
How do you parse the fixed-width records? The unpack function is recommended for speed.
Not a speed issue, but Text::CSV will give accurate CSV records from a list.
After Compline, Zaxo
| [reply] |
Re: What's the most efficient way to write out many lines of data?
by particle (Vicar) on Jul 09, 2003 at 19:14 UTC
|
this code is 100% untested, but i think it might do something like what you want. i wish i had more time now, but i'll have to leave it to others to comment if you don't understand. sorry again to dump and run, but i hope this helps.
Update: okay, i found a couple minutes to comment the code and test the script. i've updated the code to a working version, and provided sample input and output. let me know if you have any trouble with this. if the comments aren't enough help, we'll be happy to explain what's going on.
#!/usr/bin/perl
use strict;
use warnings;
use 5.006;
$|++;
require Text::CSV; ## for writing a proper CSV file
## get arguments from the commandline
my( $infile, $outfile )= @_;
## set the template for unpacking the database
## see 'perldoc -f pack' for details
my $template= 'A30A30A40';
## verify arguments or display usage
2 == @ARGV
or die "usage: db2csv infile outfile\n";
## open the files, $infile for reading, $outfile for writing
## see 'perldoc perlopentut' for details
open local(*IN) => '<', $infile
or die 'infile:', $!;
open local(*OUT) => '>', $outfile
or die 'outfile:', $!;
## create a CSV object
my $csv= Text::CSV->new();
## set the input record seperator to 100 bytes
## see 'perldoc perlvar' for details
local $/= \100;
## process infile a line at a time
while( <IN> )
{
## unpack the database record
## and create a csv record
$csv->combine( unpack $template => $_ )
or die "cannot create record $.:", $csv->error_input() || $!;
## print the string to the output file
print OUT $csv->string(), "\n";
}
__END__
sample input:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCC
+CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCddddddddddddddddddddddddddddddeeeeeeeee
+eeeeeeeeeeeeeeeeeeeeeffffffffffffffffffffffffffffffffffffffff
sample output:
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB","CCC
+CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"
"dddddddddddddddddddddddddddddd","eeeeeeeeeeeeeeeeeeeeeeeeeeeeee","fff
+fffffffffffffffffffffffffffffffffffff"
~Particle *accelerates* | [reply] [d/l] |
Re: What's the most efficient way to write out many lines of data?
by flounder99 (Friar) on Jul 09, 2003 at 20:34 UTC
|
Without seeing your code I don't know for sure but I think your problem is not Perl but I/O. I created this test program to create a file with 554,152 100 character records. I then reopen the file and split it into 10 character comma delimited fields using a regexp which I thought would slow.
use Time::HiRes qw/ gettimeofday /;
use strict;
my $starttime = gettimeofday;
open OUTFILE, ">file.txt" or die $!;
for (1 .. 554152) {
print OUTFILE "X"x100, "\n";
}
close OUTFILE;
print "Creating file took ", gettimeofday - $starttime, " seconds\n";
$starttime = gettimeofday;
open INFILE, "<file.txt" or die $!;
open OUTFILE, ">file1.txt" or die $!;
while (<INFILE>) {
chomp;
print OUTFILE join (",", /(.{,10})/g), "\n"
}
print "Splitting file using regexp took ", gettimeofday - $starttime,
+" seconds\n";
__OUTPUT__
Creating file took 8.515625 seconds
Splitting file using regexp took 1.5 seconds
This is on Win2k/Activeperl 806 using a fast P4 with a 10k rpm hard drive and 1Gb ram so the read is probably all from the disk cache. Check your code and make sure you aren't opening the output file for every line. I've seen people do that and slow things to a crawl.
-- flounder | [reply] [d/l] |
|
|
Thanks for the interesting example... I had to try it as the times you quoted seemed very fast. But on my P4 with 512Mb RAM, and <who knows?> disc speed your program creates the file in around 3.5 secs!
However the "converted file" only contained "\n"s.
I'm no guru, but I worked out that (I think) you need a split before the join, or you won't have a list that join requires.
print OUTFILE join (",", (split /(XXXXXXXXXX)/)), "\n";
worked, sort of, for me - I couldn't get your /(.{,10})/ pattern to work, although I think I understand what it's trying to match - any 10 chars exactly and reflect those in the stream as well as the other characters - which in this case aren't any. The resulting file simply had "\n"s as the split didn't seem to find a match.
This then took around 20 seconds...
$ perl file.pl
Creating file took 3.40489602088928 seconds
Splitting file using regexp took 19.5581229925156 seconds
This results in lines containing collections of 10 sets of the following ",XXXXXXXXXX," so that 2 commas appear between adjacent groups of X's and at the beginning and end of each line.
| [reply] [d/l] [select] |
|
|
print OUTFILE join (",", /(.{1,10})/g), "\n"
and I got the results:
Creating file took 9.90625 seconds
Splitting file using regexp took 12 seconds
a lot slower but nowhere near 16 minutes.
-- flounder | [reply] [d/l] [select] |
|
|
Re: What's the most efficient way to write out many lines of data?
by bluto (Curate) on Jul 09, 2003 at 18:38 UTC
|
If possible, make sure both the input and output files are on different physical disks. This alone can sometimes double the throughput. Similarly, avoid using a network filesystem for either file.
bluto | [reply] |
Re: What's the most efficient way to write out many lines of data?
by MrCromeDome (Deacon) on Jul 09, 2003 at 18:20 UTC
|
We'd be happy to look, but there's not much we can do unless you post a sample of the offending code ;) It sounds fine, but, can't tell you for sure without seeing the goods.
Cheers!
MrCromeDome | [reply] |
Re: What's the most efficient way to write out many lines of data?
by Thelonius (Priest) on Jul 09, 2003 at 21:43 UTC
|
I have to say that I'm with flounder99 on this. Are your files on network drives, by any chance?
Just for comparison, you could try this (at the shell): time dd conv=unblock cbs=100 if=inputfile of=outputfile
This is the bare minimum conversion. When I ran this compared to your Perl code, there was hardly any difference.
| [reply] [d/l] |
|
|
Nope...no network drives involved. :)
| [reply] |
Re: What's the most efficient way to write out many lines of data?
by sauoq (Abbot) on Jul 09, 2003 at 18:21 UTC
|
On a test conversion, a file containing 554,152 records took over 16 minutes to complete.
That does sound long (though I have no idea how big the records are.) You are unlikely to improve much on the efficiency of print(). You might be able to improve on your conversion though. Show us the code.
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
|
|
Sorry for the delay, but I got distracted by work. :) Records tend to be 1300 characters in length and need to be separated into 80+ fields.
I apologize for not posting it before, but here's the code that does the bulk of the work.
while (<IN>) {
$record = "";
@values = unpack($template, $_);
foreach $field (@values) {
$field =~ s/\s+$//;
$record .= "\"" . $field . "\"" . $sep;
}
chop($record);
$record .= "\n";
print OUT $record;
}
I was able to trim the time down to just under 8 minutes on my 1.6 GHz P4 with ActiveState Perl. The original figure was on a Solaris server running Perl version 5.005_03.
I thought about accumulating several delimited records (say, 100 or so) into a single string to reduce the number of print commands. I thought this might reduce the I/O overhead. Anyone know if that will work or if it's just wishful thinking? :)
Many thanks for all the suggestions. :)
Larry Polk | [reply] [d/l] |
|
|
I noticed a couple of things about your code that might help to speed things up a little.
The first is that you are striping trailing spaces from your fields with a regex. 80+ calls into the regex engine per line is going to be quite expensive, and is probably unnecessary. You don't show us what unpack tempate you are using, but if you can use the template char 'A' to unpack your fields, then there is no need to take additional steps to trim trailing spaces as this will be done for you. Eg.
print "'$_' " for unpack '(A5)5', 'abcdeabcd abc ab a ';
'abcde' 'abcd' 'abc' 'ab' 'a';
You would need 5.8 in order to use the '(Ann)*' syntax, but using earlier versions of perl, you can achieve the same effect using
my $template = 'A5' x 80;
Also, the way you are building your $record var is less efficient that it could be. Once you have removed the need for the regex, you can more simply CSVify the fields using join, reducing the body of the while loop to
print '"' . join( '","', unpack '(A15)86', $_ ), "\"\n";
Thus removing the need for the intermediates @values, $record and $field and a chop which should further improve things.
This assumes that your fields don't contain any "s that would need escaping as is indicated by your code.
You seem to be running without strict and without using my. It worth noting that lexical vars are generally faster that globals, although if the above changes are possible it pretty much removes the need for either.
Your idea of accumulating 100 or so lines of output together before printing them is likely to backfire. Given the length of your lines, building up a buffer of 130k in several hundred (80-fields x100) steps is likely to cause lots of reallocing and copying of memory. NOTE: This is speculation. It may be that perl is clever enough to re-use the largest lump of memory for the second and subsequent lines, but given the step wise manner in which it would be accumulated, it probably isn't.
It probably would be worth while ensuring that you have buffering turned on for STDOUT. Perl is probably already quite adept at buffering the output in a fairly optimal fashion given the chance.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] [d/l] [select] |
|
|