in reply to Re: Looking for ways to speed up the parsing of a file...
in thread Looking for ways to speed up the parsing of a file...

Hi, All,

Thanks for all of the great suggestions. I have reduced the processing by a whopping 40% after rolling in a number of suggestions. The new result is below.

I don't know how to parallel process something like this. I think splitting the file will be too time-consuming. (I am running on a system with 32 CPUs, though, so it is tempting.) I think the file sizes will have to be much bigger before I consider that.

Again, thanks everyone!!

Fiddler42

open (NETSTATS,"$input_file"); $TotalNets = 0; while (<NETSTATS>) { if (/^net \'(.*)'\:\s*$/) { $NetName = $1; $c = 1; $TotalNets++; if ($TotalNets % 100_000 == 0 && $TotalNets > 0) { print ("Parsed $TotalNets nets...\n"); } do { if (/^\s+wire capacitance\:\s+(\d+.*\d*)\s*$/) { $NetCapRaw = $1; $NetCap = $CapMultiplier*$NetCapRaw; $c++; } elsif (/^\s+wire resistance\:\s+(\d+.*\d*)\s*$/) { $NetRes = $1; $c++; } elsif (/^\s+number of loads\:\s+(\d+)\s*$/) { $NetFanout = $1; $c++; } elsif (/^\s+total wire length\:\s+(\d+.*\d*)\s*/) { $NetLength = $1; $c++; } $_ = <NETSTATS>; } until ((/Driver Pins/) || ($_ eq "" )); if (/Driver Pins/) { $_ = <NETSTATS>; $_ = <NETSTATS>; ($FirstDriver) = $_ =~ /^\s*(\S.*)\s*/; $c++; } $AddToCustomTable = 0; if (($NetName ne "") && (($NetCap ne "") && ($NetCap ne "NaN") +) && (($NetRes ne "") && ($NetRes ne "NaN")) && (($NetFanout ne "") & +& ($NetFanout ne "NaN")) && (($NetLength ne "") && ($NetLength ne "Na +N")) && ($FirstDriver ne "") && ($c == 6)) { if ($NetFanout <= $UpperFanoutLimitOfTable) { if (($UseNetPattern == 0) && ($UseDriverCell == 0) && +($TopLevelOnly == 0)) { $AddToCustomTable = 1; } elsif (($UseNetPattern == 0) && ($UseDriverCell == 0 +) && ($TopLevelOnly == 1)) { $DriverForwardSlashCount = $FirstDriver =~ s/(\/)/ +$1/gs; # Simple command to count characters... $NetNameForwardSlashCount = $NetName =~ s/(\/)/$1/ +gs; if (($DriverForwardSlashCount <= 1) && ($NetNameFo +rwardSlashCount <= 1 )) {$AddToCustomTable = 1;} if ($DebugMode == 1) { print ("Adding net $NetName (driver = $FirstDr +iver)...\n"); print DEBUG_VERBOSE ("$NetFanout $NetRes\n"); } } elsif (($UseNetPattern == 0) && ($UseDriverCell == 1 +) && ($TopLevelOnly == 0)) { if ($FirstDriver =~ qr/$DriverPattern/x) {$AddToCu +stomTable = 1;} # to regard variable as a regular expression... } elsif (($UseNetPattern == 1) && ($UseDriverCell == 0 +) && ($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) {$AddToCustomTab +le = 1;} } elsif (($UseNetPattern == 1) && ($UseDriverCell == 1 +) && ($TopLevelOnly == 0)) { if ($NetName =~ qr/$NetPattern/x) { $AddToCustomTable = 1; } elsif ($FirstDriver =~ qr/$DriverPattern/x) {$Ad +dToCustomTable = 1;} } # These conditions are not allowed per input argument +parsing... #} elsif (($UseNetPattern == 0) && ($UseDriverCell == +1) && ($TopLevelOnly == 1)) { #} elsif (($UseNetPattern == 1) && ($UseDriverCell == +0) && ($TopLevelOnly == 1)) { #} elsif (($UseNetPattern == 1) && ($UseDriverCell == +1) && ($TopLevelOnly == 1)) { } if ($AddToCustomTable == 1) {push @{$NetStats[ $NetFanout +] ||= []}, [ $NetName, $NetCap, $NetRes, $NetLength, $FirstDriver ];} } else { if ($DebugMode == 1) { print DEBUG_VERBOSE ("ERROR: Problem deriving stats fo +r net $NetName!\n"); print DEBUG_VERBOSE ("ERROR: c=$c NetName=$NetName Net +Fanout=$NetFanout NetCap=$NetCap NetRes=$NetRes NetLength=$NetLength +FirstDriver=$FirstDriver\n\n"); } } } $NetName = ""; $NetCap = "NaN"; $NetRes = "NaN"; $NetFanout = "NaN"; $NetLength = "NaN"; $FirstDriver = ""; } print ("Parsed $TotalNets nets...\n\n"); close (NETSTATS);

Replies are listed 'Best First'.
Re^3: Looking for ways to speed up the parsing of a file...
by samtregar (Abbot) on May 18, 2008 at 17:01 UTC
    32 CPUs, wow! If you're interested, I think splitting the file should be very easy. To split it N ways I would:

    1. $start = 0
    2. seek() forward int($size / $N).
    3. Search forward for the next "^net" delimiter line, capturing start position in $here.
    4. Write out the chunk from $start to $here to a file "chunk.$N".
    5. Set $start = $here
    6. Loop to 2 until done.

    Whether this is an overall win (it will take time) depends a lot on your disks. You'll have to tune $N to match the number of CPUs you can keep fed with data - set it too high and you'll go slower as your CPUs compete for disk access and slow each other down.

    -sam

      Instead of actually splitting the file into several additional files, you could just determine the positions as you describe, then work on the different parts by seeking to the right position before starting your processing loop. For example, you could determine the start and end position and then fork() off a new process to work on that chunk. Reading from multiple files (or different places in the same file) in parallel might end up being less efficient from an I/O perspective, though, as it could require the drive to seek a lot more. So you'd need to experiment a bit to find the right way to parallelize this.