in reply to Looking for ways to speed up the parsing of a file...

I'm wondering if 10 minutes to process 6 GB of data might not be all that bad -- maybe there's not all that much room for improvement. Have you checked how long it takes to run a script like this on that sort of data file?
while (<>) { @_ = split; $wc += @_ } print "counted $wc white-space-delimited tokens\n";
That would show you the limit on how quickly the file could be processed using perl.

In any case, I'd be more inclined to look for ways to economize on the amount of code being written to accomplish the task. One thing that might simplify the logic a lot would be to do record-oriented input (rather than line-oriented input):

open (NETSTATS,"$input_file"); $/ = "\nnet '"; my @field_names = ( 'wire capacitance', 'wire resistance', 'number of loads', 'number of pins', 'total wire length' ); while (<NETSTATS>) { chomp; # removes $/ from end of string; warn "got record # $.\n" if ( $. % 50000 == 0 ); # this belongs o +n STDERR, IMO next unless (/^([^']+)'/); $NetName = $1; my %field_val = (); for my $field ( @field_names ) { if ( /$field:\s+([\d.]+)/ ) { $field_val{$field} = $1; } } # .... }
(updated the "next unless" line, and added assignment to $NetName, as per OP)

I'm not going to try reimplementing the whole thing, but just that little snippet should give you the basic idea of how I would go about it. The part I've shown replaces approximately the first 50 lines of code from the OP. As for the rest, instead of testing a bunch of distinct scalar variables in order to determine what to do with the record data, you are instead checking the keys and values of a hash, which can be done with less code.

Even if it ends up running a little slower than the original (though I doubt it would), there are other advantages in terms of clarity and maintainability of the code.

And with this sort of approach, it might be easier to find tricks that will speed it up -- e.g. the regex matches in the for loop might be quicker if done like this (because with each iteration, $_ becomes shorter, and the target string is near the beginning):

for my $field ( @field_names ) { if ( s/.*?\s$field:\s+([\d.]+)//s ) { $field_val{$field} = $1; } }
The main point is that by reading the data one whole record at a time, the logic becomes a lot easier (and might end up running faster, as well).

Replies are listed 'Best First'.
Re^2: Looking for ways to speed up the parsing of a file...
by fiddler42 (Beadle) on May 18, 2008 at 02:28 UTC
    Hi, All,

    Thanks for all of the great suggestions. I have reduced the processing by a whopping 40% after rolling in a number of suggestions. The new result is below.

    I don't know how to parallel process something like this. I think splitting the file will be too time-consuming. (I am running on a system with 32 CPUs, though, so it is tempting.) I think the file sizes will have to be much bigger before I consider that.

    Again, thanks everyone!!

    Fiddler42

    open (NETSTATS,"$input_file"); $TotalNets = 0; while (<NETSTATS>) { if (/^net \'(.*)'\:\s*$/) { $NetName = $1; $c = 1; $TotalNets++; if ($TotalNets % 100_000 == 0 && $TotalNets > 0) { print ("Parsed $TotalNets nets...\n"); } do { if (/^\s+wire capacitance\:\s+(\d+.*\d*)\s*$/) { $NetCapRaw = $1; $NetCap = $CapMultiplier*$NetCapRaw; $c++; } elsif (/^\s+wire resistance\:\s+(\d+.*\d*)\s*$/) { $NetRes = $1; $c++; } elsif (/^\s+number of loads\:\s+(\d+)\s*$/) { $NetFanout = $1; $c++; } elsif (/^\s+total wire length\:\s+(\d+.*\d*)\s*/) { $NetLength = $1; $c++; } $_ = <NETSTATS>; } until ((/Driver Pins/) || ($_ eq "" )); if (/Driver Pins/) { $_ = <NETSTATS>; $_ = <NETSTATS>; ($FirstDriver) = $_ =~ /^\s*(\S.*)\s*/; $c++; } $AddToCustomTable = 0; if (($NetName ne "") && (($NetCap ne "") && ($NetCap ne "NaN") +) && (($NetRes ne "") && ($NetRes ne "NaN")) && (($NetFanout ne "") & +& ($NetFanout ne "NaN")) && (($NetLength ne "") && ($NetLength ne "Na +N")) && ($FirstDriver ne "") && ($c == 6)) { if ($NetFanout <= $UpperFanoutLimitOfTable) { if (($UseNetPattern == 0) && ($UseDriverCell == 0) && +($TopLevelOnly == 0)) { $AddToCustomTable = 1; } elsif (($UseNetPattern == 0) && ($UseDriverCell == 0 +) && ($TopLevelOnly == 1)) { $DriverForwardSlashCount = $FirstDriver =~ s/(\/)/ +$1/gs; # Simple command to count characters... $NetNameForwardSlashCount = $NetName =~ s/(\/)/$1/ +gs; if (($DriverForwardSlashCount <= 1) && ($NetNameFo +rwardSlashCount <= 1 )) {$AddToCustomTable = 1;} if ($DebugMode == 1) { print ("Adding net $NetName (driver = $FirstDr +iver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n"); } } elsif (($UseNetPattern == 0) && ($UseDriverCell == 1 +) && ($TopLevelOnly == 0)) { if ($FirstDriver =~ qr/$DriverPattern/x) {$AddToCu +stomTable = 1;} # to regard variable as a regular expression... } elsif (($UseNetPattern == 1) && ($UseDriverCell == 0 +) && ($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) {$AddToCustomTab +le = 1;} } elsif (($UseNetPattern == 1) && ($UseDriverCell == 1 +) && ($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) { $AddToCustomTable = 1; } elsif ($FirstDriver =~ qr/$DriverPattern/x) {$Ad +dToCustomTable = 1;} } # These conditions are not allowed per input argument +parsing... #} elsif (($UseNetPattern == 0) && ($UseDriverCell == +1) && ($TopLevelOnly == 1)) { #} elsif (($UseNetPattern == 1) && ($UseDriverCell == +0) && ($TopLevelOnly == 1)) { #} elsif (($UseNetPattern == 1) && ($UseDriverCell == +1) && ($TopLevelOnly == 1)) { } if ($AddToCustomTable == 1) {push @{$NetStats[ $NetFanout +] ||= []}, [ $NetName, $NetCap, $NetRes, $NetLength, $FirstDriver ];} } else { if ($DebugMode == 1) { print DEBUG_VERBOSE ("ERROR: Problem deriving stats fo +r net $NetName!\n"); print DEBUG_VERBOSE ("ERROR: c=$c NetName=$NetName Net +Fanout=$NetFanout NetCap=$NetCap NetRes=$NetRes NetLength=$NetLength +FirstDriver=$FirstDriver\n\n"); } } } $NetName = ""; $NetCap = "NaN"; $NetRes = "NaN"; $NetFanout = "NaN"; $NetLength = "NaN"; $FirstDriver = ""; } print ("Parsed $TotalNets nets...\n\n"); close (NETSTATS);
      32 CPUs, wow! If you're interested, I think splitting the file should be very easy. To split it N ways I would:

      1. $start = 0
      2. seek() forward int($size / $N).
      3. Search forward for the next "^net" delimiter line, capturing start position in $here.
      4. Write out the chunk from $start to $here to a file "chunk.$N".
      5. Set $start = $here
      6. Loop to 2 until done.

      Whether this is an overall win (it will take time) depends a lot on your disks. You'll have to tune $N to match the number of CPUs you can keep fed with data - set it too high and you'll go slower as your CPUs compete for disk access and slow each other down.

      -sam

        Instead of actually splitting the file into several additional files, you could just determine the positions as you describe, then work on the different parts by seeking to the right position before starting your processing loop. For example, you could determine the start and end position and then fork() off a new process to work on that chunk. Reading from multiple files (or different places in the same file) in parallel might end up being less efficient from an I/O perspective, though, as it could require the drive to seek a lot more. So you'd need to experiment a bit to find the right way to parallelize this.