phi.jones has asked for the wisdom of the Perl Monks concerning the following question:

I need help speeding up the following perl code. It pulls a line out of one file, splits into an array and uses one of the fields from the array as a filename to open a filehandle. It then splits this file into an array and searches for matches in the second file using variables taken from the first. Anyway the second search is taking is about 1 minute for every loop ( the files are around 180mb). I am using the foreach operator on this as I ma hoping to keep the file in memory whil doing the search ( there are up to 3000 searches per file) Any ideas gratefully accepted! Thanks, Phil
#!/usr/bin/perl -w $[ = 1; $, = ','; $\ = "\n"; if (! open CDR_DETAILS, "cdr_details.1") { die "couldn't open cdr_details.1: $! "; } open CDR_TDM, ">cdr_tdm.csv"; $XDRFILE="DUMMY"; while (<CDR_DETAILS>) { chomp; @ITEM = split(',',$_,9999); $r++; print "Record number: $r"; $i=0; $CDRDATE = $ITEM[1]; print "CDRDATE: $CDRDATE"; $CDRTIME = $ITEM[2]; $CDRTIME =~ s/^0*//g; if ($CDRTIME le "0") { $CDRTIME="0"; } print "CDRTIME: $CDRTIME"; $CLI1 = $ITEM[3]; print "CLI1: $CLI1"; $CLI2 = $ITEM[7]; print "CLI2: $CLI2"; if ( $ITEM[8] ne $XDRFILE) { print "$ITEM[8]!= $XDRFILE opening file!"; $XDRFILE = $ITEM[8]; if (! open XDR, "$XDRFILE") { die "couldn't open $XDRFILE :$!"; } } foreach (<XDR>) { chomp; @Fld = split(',', $_, 9); if ($Fld[2] eq ${CDRDATE} && $Fld[3] eq ${CDRTIME} && $Fld +[5] eq ${CLI1} && $Fld[7] eq ${CLI2}) { select CDR_TDM; print $_; select STDOUT; } } seek XDR,0,0; }

Replies are listed 'Best First'.
Re: Should this be so slow ?
by Roy Johnson (Monsignor) on Feb 10, 2006 at 22:08 UTC
    You're not helping yourself in any way by reading the whole file into memory. It's not clear from your code how you expect to re-use the lines you've read in. The foreach loop will walk through them and they'll be released.

    I'd say read your config file, figure out all the files you're going to have to search and all the terms you're going to have to search for, then read the files line-by-line and do all your searches on each line as you go.


    Caution: Contents may have been coded under pressure.
Re: Should this be so slow ?
by ikegami (Patriarch) on Feb 11, 2006 at 00:58 UTC
    select CDR_TDM; print $_; select STDOUT;

    is a rather silly way of doing

    print CDR_TDM $_;
Re: Should this be so slow ?
by duff (Parson) on Feb 11, 2006 at 06:51 UTC

    Yes, it should be slow; you're reading 180 MB files 3000 or so times. Is there some ordering relationship that you can exploit so that you only have to read the 180MB files once? I can guess that the files would be ordered by date+time. If that's the case, you could do something like this (obviously untested):

    #!/usr/bin/perl use warnings; use strict; $[ = 1; $\ = "\n"; my ($last_xdrfile,$xline); open CDR_DETAILS, "cdr_details.1" or die "couldn't open cdr_details.1: + $!\n"; open CDR_TDM, ">cdr_tdm.csv" or die "couldn't write cdr_tdm.csv: $!\n" +; while (<CDR_DETAILS>) { chomp; my ($cdrdate,$cdrtime,$cdrcli1,$cdrcli2,$xdrfile) = (split /,/)[1, +2,3,7,8]; $cdrtime += 0; print "Record number: $."; print "CDRDATE: $cdrdate"; print "CDRTIME: $cdrtime"; print "CLI1: $cdrcli1"; print "CLI2: $cdrcli2"; if ($last_xdrfile ne $xdrfile) { print "$xdrfile != $last_xdrfile, opening file!"; open(XDR,$xdrfile) or die "couldn't open $xdrfile: $!\n"; chomp($xline = <XDR>); } $last_xdrfile = $xdrfile; while (defined $xline) { my ($xdrdate,$xdrtime,$xdrcli1,$xdrcli2)=(split /,/, $xline)[2 +,3,5,7]; last if $xdrdate gt $cdrdate; last if $xdrtime gt $cdrtime; next unless $xdrdate eq $cdrdate && $xdrtime eq $cdrtime && $xdrcli1 eq $cdrcli1 && $xdrcli1 eq $cdrcli2; print CDR_TDM $xline; } continue { chomp($xline = <XDR>) } }
    I'm making some gross assumptions about the formats of the date and time though. If they're not in a format that's directly comparable, you'll need to convert to a form that is (possibly using something like Date::Parse)

    What my version does is rather than read the entire 180MB file into memory, it reads it a line at a time assuming both this file and the one generating the "searches" are in date+time order. There are a couple of gyrations to get the XDRFILE to read properly:

    • we do a priming read as soon as we open the file
    • subsequent reads happen in a continue block so that $xline has the last line read the next time we process a line from the CDRFILE.
Re: Should this be so slow ?
by graff (Chancellor) on Feb 11, 2006 at 16:50 UTC
    To expand a bit on Roy Johnson's suggestion, you could read the entire "cdr_details.1" file at the beginning, and build a hash-of-arrays (HoA): the hash is keyed by the "xdr" file names to be searched, and each hash element is an array of patterns to search for in that file.

    Once the HoA structure is filled, loop over the hash keys (file names to open), and as you read each line from the current file, loop over the search patterns and output the current line if there's a match.

    Something like this:

    use strict; use warnings; my %search; open( LIST, "cdr_details.1" ) or die "cdr_details.1: $!"; while (<LIST>) { chomp; my @terms = (split /,/, $_, -1 )[0,1,2,6,7]; my $xdrfile = pop @terms; # file to search is last term push @{$search{$xdrfile}}, join( "\0", @terms ); # save remaining terms as a null-byte-separated string # multiple strings are pushed into an array for each xdr file } close LIST; open( OUT, ">cdr_tdm.csv" ) or die "cdr_tdm.csv: $!"; for my $xdrfile ( sort keys %search ) { my @findsets = @{$search{$xdrfile}}; open( XDR, $xdrfile ) or do { warn "$xdrfile: $!\n"; next }; while (<XDR>) { chomp; my $fldset = join( "\0", (split /,/, $_, -1 )[1,2,4,6] ); # $fldset is a null-byte-separated string that could match +findsets for my $findset ( @findsets ) { if ( $fldset eq $findset ) { print OUT; last; } } } close XDR; }
    That will output lines in the order of the xdr file names that contain them, rather than the order of the list of fields to search for (i.e. the ordering in your cdr_details file).

    If you want the "cdr_tdm" list sorted some other way, just sort that file after this script is done writing it. (The unix "sort" command is good for that, though a perl script to do the same thing would be pretty simple as well.)