fiddler42 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, All,

I need to parse some very large files that are formatted like so:-

net 'IR_REG_INST_INT[20]': dont_touch: FALSE pin capacitance: 0.00458335 wire capacitance: 0.00103955 total capacitance: 0.0056229 wire resistance: 0.0663061 number of drivers: 1 number of loads: 2 number of pins: 3 total wire length: 9.20 (Routed) X_length = 0.96, Y_length = 8.24 number of vias: 6 Connections for net 'IR_REG_INST_INT[20]': Driver Pins Type Pin Cap Pin Loc ------------ ---------------- -------- -------- U195/o Output Pin (invx20) 0.00162106 [1.12 409.8 +8] Load Pins Type Pin Cap Pin Loc ------------ ---------------- -------- -------- U196/c Input Pin (and3x10) 0.00161077 [1.131 401. +15] U1460/a Input Pin (or2x05) 0.00135152 [1.68 409.2 +2]

I have several million summaries like the above in one giant file (c. 6 Gig in size). I need to collect pertinent details from each "net" summary. Here is exactly what I am doing:-

open (NETSTATS,"$input_file"); $TotalNets = 0; while (<NETSTATS>) { if ($_ =~ /^net \'/) { ($NetName) = $_ =~ /^net \'(.*)'\:\s*$/; $c = 1; $TotalNets++; if (($TotalNets == 50000) || ($TotalNets == 100000) || ($Total +Nets == 250000) || ($TotalNets == 500000) || ($TotalNets == 1000000) +|| ($TotalNets == 1500000) || ($TotalNets == 2000000) || ($TotalNets +== 3000000)) { print ("Parsed $TotalNets nets...\n"); } do { if ($_ =~ /wire capacitance/) { if ($_ =~ /^\s+wire capacitance\:\s+\d.*\d\s*$/) { ($NetCapRaw) = $_ =~ /^\s+wire capacitance\:\s+(\d +.*\d)\s*$/; $NetCap = $CapMultiplier*$NetCapRaw; $c++; } else { $NetCap = "NaN"; } } if ($_ =~ /wire resistance/) { if ($_ =~ /^\s+wire resistance\:\s+\d.*\d\s*$/) { ($NetRes) = $_ =~ /^\s+wire resistance\:\s+(\d.*\d +)\s*$/; $c++; } else { $NetRes = "NaN"; } } if ($_ =~ /number of loads/) { if ($_ =~ /^\s+number of loads\:\s+\d+\s*$/) { ($NetFanout) = $_ =~ /^\s+number of loads\:\s+(\d+ +)\s*$/; $c++; } else { $NetFanout = "NaN"; } } if ($_ =~ /total wire length/) { if ($_ =~ /^\s+total wire length\:\s+\d.*\d\s*/) { ($NetLength) = $_ =~ /^\s+total wire length\:\s+(\ +d.*\d)\s*/; $c++; } else { $NetLength = "NaN"; } } $_ = <NETSTATS>; } until (($_ =~ /Driver Pins/) || ($_ eq "" )); if ($_ =~ /Driver Pins/) { $_ = <NETSTATS>; $_ = <NETSTATS>; $_ =~ s/^ *//; $_ =~ s/ *$//; @DriverLine = split (/\s+/,$_); $FirstDriver = $DriverLine[0]; $c++; } $AddToCustomTable = 0; if (($NetName ne "") && (($NetCap ne "") && ($NetCap ne "NaN") +) && (($NetRes ne "") && ($NetRes ne "NaN")) && (($NetFanout ne "") & +& ($NetFanout ne "NaN")) && (($NetLength ne "") && ($NetLength ne "Na +N")) && ($FirstDriver ne "") && ($c == 6)) { if ($NetFanout <= $UpperFanoutLimitOfTable) { if (($UseNetPattern == 0) && ($UseDriverCell == 0) && +($TopLevelOnly == 0)) { $AddToCustomTable = 1; } if (($UseNetPattern == 0) && ($UseDriverCell == 0) && +($TopLevelOnly == 1)) { $DriverForwardSlashCount = $FirstDriver =~ s/(\/)/ +$1/gs; # Simple command to count characters... $NetNameForwardSlashCount = $NetName =~ s/(\/)/$1/ +gs; if (($DriverForwardSlashCount == 0) && ($NetNameFo +rwardSlashCount == 0)) { $AddToCustomTable = 1; if ($DebugMode == 1) { print ("Adding net $NetName (driver = $Fir +stDriver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n +"); } } if (($DriverForwardSlashCount == 0) && ($NetNameFo +rwardSlashCount == 1)) { $AddToCustomTable = 1; if ($DebugMode == 1) { print ("Adding net $NetName (driver = $Fir +stDriver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n +"); } } if (($DriverForwardSlashCount == 1) && ($NetNameFo +rwardSlashCount == 0)) { $AddToCustomTable = 1; if ($DebugMode == 1) { print ("Adding net $NetName (driver = $Fir +stDriver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n +"); } } if (($DriverForwardSlashCount == 1) && ($NetNameFo +rwardSlashCount == 1)) { $AddToCustomTable = 1; if ($DebugMode == 1) { print ("Adding net $NetName (driver = $Fir +stDriver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n +"); } } } if (($UseNetPattern == 0) && ($UseDriverCell == 1) && +($TopLevelOnly == 0)) { if ($FirstDriver =~ qr/$DriverPattern/x) { # to re +gard variable as a regular expression... $AddToCustomTable = 1; } } #if (($UseNetPattern == 0) && ($UseDriverCell == 1) && + ($TopLevelOnly == 1)) { # This condition not allowed per input argument pa +rsing... #} if (($UseNetPattern == 1) && ($UseDriverCell == 0) && +($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) { $AddToCustomTable = 1; } } #if (($UseNetPattern == 1) && ($UseDriverCell == 0) && + ($TopLevelOnly == 1)) { # This condition not allowed per input argument pa +rsing... #} if (($UseNetPattern == 1) && ($UseDriverCell == 1) && +($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) { $AddToCustomTable = 1; } elsif ($FirstDriver =~ qr/$DriverPattern/x) { $AddToCustomTable = 1; } } #if (($UseNetPattern == 1) && ($UseDriverCell == 1) && + ($TopLevelOnly == 1)) { # This condition not allowed per input argument pa +rsing... #} } if ($AddToCustomTable == 1) { #use constant (p_name => 0, p_cap => 1, p_res => 2, p_ +length => 3, p_driver => 4); push @{$NetStats[ $NetFanout ] ||= []}, [ $NetName, $N +etCap, $NetRes, $NetLength, $FirstDriver ]; } } else { if ($DebugMode == 1) { print DEBUG_VERBOSE ("ERROR: Problem deriving stats fo +r net $NetName!\n"); print DEBUG_VERBOSE ("ERROR: c=$c NetName=$NetName Net +Fanout=$NetFanout NetCap=$NetCap NetRes=$NetRes NetLength=$NetLength +FirstDriver=$FirstDriver\n\n"); } } } $NetName = ""; $NetCap = "NaN"; $NetRes = "NaN"; $NetFanout = "NaN"; $NetLength = "NaN"; $FirstDriver = ""; } print ("Parsed $TotalNets nets...\n\n"); close (NETSTATS);

...but I would really like to speed up the parsing of the input file. Does anyone have any general suggestions? It currently takes about 10 minutes to process the file, and I would like to pull that in a little. I know I am kinda splitting hairs here (10 minutes is reasonable, after all), but, again, any little improvement here and there would be much appreciated.

Thanks,

Fiddler42

Replies are listed 'Best First'.
Re: Looking for ways to speed up the parsing of a file...
by samtregar (Abbot) on May 17, 2008 at 19:02 UTC
    If you want to speed up your code you have to know where you're spending your time. The best tool for this job is a profiler. Usually I would suggest Devel::DProf but you don't have any subroutines. Something line-oriented like Devel::SmallProf might be better. Once you know where your program is spending most of its time you can focus on improving that part.

    One thing that jumps out at me skimming your code is this pattern:

    if ($_ =~ /^\s+total wire length\:\s+\d.*\d\s*/) { ($NetLength) = $_ =~ /^\s+total wire length\:\s+(\ +d.*\d)\s*/;

    There's no need to use two regexes here, and also no need to be explicit about $_ (it's the default match target). You can just do:

    if (/^\s+total wire length\:\s+(\d.*\d)\s*/) { $NetLength = $1;

    That should shave off a bit of time.

    Another possibility to consider that profiling won't immediately suggest is parallellization. Do you have multiple CPUs or multiple machines at your disposal? If so, divide the file into pieces and hand them off to multiple processes to work on simulataneously.

    -sam

      The OP also wants to test if the text following the colon is a numeric value or something else. It still can be done with one regex, though:
      if (/^\s+total wire length\:\s+(\d.*\d)?/) { if (defined($1)) { $NetLength = $1; $c++; } else { $NetLength = "NaN"; } }
Re: Looking for ways to speed up the parsing of a file...
by pc88mxer (Vicar) on May 17, 2008 at 18:57 UTC
    Not that you shouldn't try to find a faster or cleaner way to do this, but I'm of the opinion that correctness trumps speed. After you have a correct algorithm you usually can obtain faster speeds by parallelizing the task, and that's probably the best use of your time.

    That said, here are some small improvements to your code that could result in better performance:

    1) Use elsif where possible, i.e.:

    if (/wire capacitance/) { ... } elsif (/wire resistance/) { ...
    From your example file, it doesn't seem like you'll see both wire capacitance and wire resistance on the same line. Also, consider using elsif in the other chain of if statements if only one condition can be true at a time.

    2) Use: if ($TotalNets % 50_000 == 0 && $TotalNets > 0) ... to test if you want to print out the trace message. It's simpler, cleaner and faster.

Re: Looking for ways to speed up the parsing of a file...
by apl (Monsignor) on May 17, 2008 at 20:12 UTC
    You've already gotten the best suggestions, so here's some minor ones: Replace
    if (($TotalNets == 50000) || ($TotalNets == 100000) || ($Total +Nets == 250000) || ($TotalNets == 500000) || ($TotalNets == 1000000) +|| ($TotalNets == 1500000) || ($TotalNets == 2000000) || ($TotalNets +== 3000000)) {
    with a hash (e.g. %Nets) containing the above values and
    if ( $Nets{$TotalNets) ) {
    I like pc88mxer suggestion, but I notice that (for example) 150,000 is not in your test, so modulo 50,000 would give you a false positive.

    Another suggestion would be to replace

    $_ =~ s/^ *//; $_ =~ s/ *$//; @DriverLine = split (/\s+/,$_);
    with
    @DriverLine = split( /\s+/, (( /^\s*(\S.*)\s*$/ ) ? $1 : $_ ) );

    The code hasn't been tested, and I can't swear to the efficiencies.

      Split has a magical whitespace incantation:

      $_ = ' foo bar baz '; print "'$_'\n" for split ' ';

      So you don't need to trim leading or trailing whitespace to get the desired result.

      As to the final suggestion, it might be more efficient to replace what is in effect two regexes (plus some logic) with just one:

      >perl -wMstrict -e "for (@ARGV) { my @s = m{ \S+ }xmsg; local $\" = ':'; print qq('$_' :@s: \n); } " "" " " " " " " " " "foo" " foo" "foo " " foo " "foo bar" " foo bar" "foo bar " " foo bar " "foo bar baz" " foo bar baz" "foo bar baz " " foo bar baz " '' :: ' ' :: ' ' :: ' ' :: ' ' :: 'foo' :foo: ' foo' :foo: 'foo ' :foo: ' foo ' :foo: 'foo bar' :foo:bar: ' foo bar' :foo:bar: 'foo bar ' :foo:bar: ' foo bar ' :foo:bar: 'foo bar baz' :foo:bar:baz: ' foo bar baz' :foo:bar:baz: 'foo bar baz ' :foo:bar:baz: ' foo bar baz ' :foo:bar:baz:
        Or probably even more efficient, come to think of it, just to use the default  split parameters:

        >perl -wMstrict -e "for (@ARGV) { my @s = split; local $\" = ':'; print qq('$_' :@s: \n); } " "" " " " " " " " " "foo" " foo" "foo " " foo " "foo bar" " foo bar" "foo bar " " foo bar " "foo bar baz" " foo bar baz" "foo bar baz " " foo bar baz " '' :: ' ' :: ' ' :: ' ' :: ' ' :: 'foo' :foo: ' foo' :foo: 'foo ' :foo: ' foo ' :foo: 'foo bar' :foo:bar: ' foo bar' :foo:bar: 'foo bar ' :foo:bar: ' foo bar ' :foo:bar: 'foo bar baz' :foo:bar:baz: ' foo bar baz' :foo:bar:baz: 'foo bar baz ' :foo:bar:baz: ' foo bar baz ' :foo:bar:baz:
Re: Looking for ways to speed up the parsing of a file...
by graff (Chancellor) on May 18, 2008 at 01:13 UTC
    I'm wondering if 10 minutes to process 6 GB of data might not be all that bad -- maybe there's not all that much room for improvement. Have you checked how long it takes to run a script like this on that sort of data file?
    while (<>) { @_ = split; $wc += @_ } print "counted $wc white-space-delimited tokens\n";
    That would show you the limit on how quickly the file could be processed using perl.

    In any case, I'd be more inclined to look for ways to economize on the amount of code being written to accomplish the task. One thing that might simplify the logic a lot would be to do record-oriented input (rather than line-oriented input):

    open (NETSTATS,"$input_file"); $/ = "\nnet '"; my @field_names = ( 'wire capacitance', 'wire resistance', 'number of loads', 'number of pins', 'total wire length' ); while (<NETSTATS>) { chomp; # removes $/ from end of string; warn "got record # $.\n" if ( $. % 50000 == 0 ); # this belongs o +n STDERR, IMO next unless (/^([^']+)'/); $NetName = $1; my %field_val = (); for my $field ( @field_names ) { if ( /$field:\s+([\d.]+)/ ) { $field_val{$field} = $1; } } # .... }
    (updated the "next unless" line, and added assignment to $NetName, as per OP)

    I'm not going to try reimplementing the whole thing, but just that little snippet should give you the basic idea of how I would go about it. The part I've shown replaces approximately the first 50 lines of code from the OP. As for the rest, instead of testing a bunch of distinct scalar variables in order to determine what to do with the record data, you are instead checking the keys and values of a hash, which can be done with less code.

    Even if it ends up running a little slower than the original (though I doubt it would), there are other advantages in terms of clarity and maintainability of the code.

    And with this sort of approach, it might be easier to find tricks that will speed it up -- e.g. the regex matches in the for loop might be quicker if done like this (because with each iteration, $_ becomes shorter, and the target string is near the beginning):

    for my $field ( @field_names ) { if ( s/.*?\s$field:\s+([\d.]+)//s ) { $field_val{$field} = $1; } }
    The main point is that by reading the data one whole record at a time, the logic becomes a lot easier (and might end up running faster, as well).
      Hi, All,

      Thanks for all of the great suggestions. I have reduced the processing by a whopping 40% after rolling in a number of suggestions. The new result is below.

      I don't know how to parallel process something like this. I think splitting the file will be too time-consuming. (I am running on a system with 32 CPUs, though, so it is tempting.) I think the file sizes will have to be much bigger before I consider that.

      Again, thanks everyone!!

      Fiddler42

      open (NETSTATS,"$input_file"); $TotalNets = 0; while (<NETSTATS>) { if (/^net \'(.*)'\:\s*$/) { $NetName = $1; $c = 1; $TotalNets++; if ($TotalNets % 100_000 == 0 && $TotalNets > 0) { print ("Parsed $TotalNets nets...\n"); } do { if (/^\s+wire capacitance\:\s+(\d+.*\d*)\s*$/) { $NetCapRaw = $1; $NetCap = $CapMultiplier*$NetCapRaw; $c++; } elsif (/^\s+wire resistance\:\s+(\d+.*\d*)\s*$/) { $NetRes = $1; $c++; } elsif (/^\s+number of loads\:\s+(\d+)\s*$/) { $NetFanout = $1; $c++; } elsif (/^\s+total wire length\:\s+(\d+.*\d*)\s*/) { $NetLength = $1; $c++; } $_ = <NETSTATS>; } until ((/Driver Pins/) || ($_ eq "" )); if (/Driver Pins/) { $_ = <NETSTATS>; $_ = <NETSTATS>; ($FirstDriver) = $_ =~ /^\s*(\S.*)\s*/; $c++; } $AddToCustomTable = 0; if (($NetName ne "") && (($NetCap ne "") && ($NetCap ne "NaN") +) && (($NetRes ne "") && ($NetRes ne "NaN")) && (($NetFanout ne "") & +& ($NetFanout ne "NaN")) && (($NetLength ne "") && ($NetLength ne "Na +N")) && ($FirstDriver ne "") && ($c == 6)) { if ($NetFanout <= $UpperFanoutLimitOfTable) { if (($UseNetPattern == 0) && ($UseDriverCell == 0) && +($TopLevelOnly == 0)) { $AddToCustomTable = 1; } elsif (($UseNetPattern == 0) && ($UseDriverCell == 0 +) && ($TopLevelOnly == 1)) { $DriverForwardSlashCount = $FirstDriver =~ s/(\/)/ +$1/gs; # Simple command to count characters... $NetNameForwardSlashCount = $NetName =~ s/(\/)/$1/ +gs; if (($DriverForwardSlashCount <= 1) && ($NetNameFo +rwardSlashCount <= 1 )) {$AddToCustomTable = 1;} if ($DebugMode == 1) { print ("Adding net $NetName (driver = $FirstDr +iver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n"); } } elsif (($UseNetPattern == 0) && ($UseDriverCell == 1 +) && ($TopLevelOnly == 0)) { if ($FirstDriver =~ qr/$DriverPattern/x) {$AddToCu +stomTable = 1;} # to regard variable as a regular expression... } elsif (($UseNetPattern == 1) && ($UseDriverCell == 0 +) && ($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) {$AddToCustomTab +le = 1;} } elsif (($UseNetPattern == 1) && ($UseDriverCell == 1 +) && ($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) { $AddToCustomTable = 1; } elsif ($FirstDriver =~ qr/$DriverPattern/x) {$Ad +dToCustomTable = 1;} } # These conditions are not allowed per input argument +parsing... #} elsif (($UseNetPattern == 0) && ($UseDriverCell == +1) && ($TopLevelOnly == 1)) { #} elsif (($UseNetPattern == 1) && ($UseDriverCell == +0) && ($TopLevelOnly == 1)) { #} elsif (($UseNetPattern == 1) && ($UseDriverCell == +1) && ($TopLevelOnly == 1)) { } if ($AddToCustomTable == 1) {push @{$NetStats[ $NetFanout +] ||= []}, [ $NetName, $NetCap, $NetRes, $NetLength, $FirstDriver ];} } else { if ($DebugMode == 1) { print DEBUG_VERBOSE ("ERROR: Problem deriving stats fo +r net $NetName!\n"); print DEBUG_VERBOSE ("ERROR: c=$c NetName=$NetName Net +Fanout=$NetFanout NetCap=$NetCap NetRes=$NetRes NetLength=$NetLength +FirstDriver=$FirstDriver\n\n"); } } } $NetName = ""; $NetCap = "NaN"; $NetRes = "NaN"; $NetFanout = "NaN"; $NetLength = "NaN"; $FirstDriver = ""; } print ("Parsed $TotalNets nets...\n\n"); close (NETSTATS);
        32 CPUs, wow! If you're interested, I think splitting the file should be very easy. To split it N ways I would:

        1. $start = 0
        2. seek() forward int($size / $N).
        3. Search forward for the next "^net" delimiter line, capturing start position in $here.
        4. Write out the chunk from $start to $here to a file "chunk.$N".
        5. Set $start = $here
        6. Loop to 2 until done.

        Whether this is an overall win (it will take time) depends a lot on your disks. You'll have to tune $N to match the number of CPUs you can keep fed with data - set it too high and you'll go slower as your CPUs compete for disk access and slow each other down.

        -sam

Re: Looking for ways to speed up the parsing of a file...
by starbolin (Hermit) on May 18, 2008 at 00:47 UTC

    This:

    if (($TotalNets == 50000) || ($TotalNets == 100000) || ($TotalNets == 250000) || ($TotalNets == 500000) || ($TotalNets == 1000000) || ($TotalNets == 1500000) || ($TotalNets == 2000000) || ($TotalNets == 3000000)) {
    should be done in parallel, ie. writing the current net to a fifo or shared memory; then display the totals with another process. Inside the read loop only do those tasks specifically necessary to processing the net records. Alternately, read the file N lines at a time:
    do { for (0..N) { if ( my $line = <FH>) { ... do stuff here ... } else { last; } } print "$Some_Total"; } until (eof );


    You're processing every token three times here:

    if ($_ =~ /wire capacitance/) { if ($_ =~ /^\s+wire capacitance\:\s+\d.*\d\s*$/) { ($NetCapRaw) = $_ =~ /^\s+wire capacitance\:\s+(\d +.*\d)\s*$/;
    Replace the token on the first pass or capture the remainder of the string and pass it to another regex.


    Actually I like the idea of tokenizing the whole file in a multi-pass interpreter; tokenize the file first replacing each token with a code-ref and each constant with an object that returns a constant. then execute the resulting file.


    What does this do?

    if (($DriverForwardSlashCount == 0) && ($NetNameForwardSlashCount == +0)) { $AddToCustomTable = 1;
    There are four copies of this and they all just set $AddToCustomTable to the same value. Isn't the following the same thing?
    $AddToCustomTable = 1 if ($DriverForwardSlashCount | $NetNameForwardSl +ashCount <= 1 );


    There are two time eaters in the code; reading the file and executing the regexes. I would try to separate those. Read the file in and split the fields, generating a hash of tokens and data ( note this is similar to the parsing idea above.) then process the hash for your data. This would seem like extra work but often when you refactor the code like this you see optimizations you wouldn't see with the code all in one mashup like it is.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
Re: Looking for ways to speed up the parsing of a file...
by wfsp (Abbot) on May 18, 2008 at 11:21 UTC
    Just in case you are looking for a way to slow it down... :-)

    I find that breaking these sort of jobs down into smaller tasks can offer benefits that out-weigh any speed penalties.

    For instance you could consider doing the 'extracting' and the 'reporting' separately. First load each 'record' into a hash and then worry about what to do with it.

    Long winded loops, complex ifs and elses and tricky 'bulk' regexes often leads, imo, to code that is difficult to write, read and maintain. And if the spec changes...

    Building on the suggestions already this does the parsing.

    There is no error checking or validation but I believe it would be easier to do that with this approach rather than with the "one big loop" method.

    update
    Redundant line at the begining of process_part_two sub commented out.