I'm wondering if 10 minutes to process 6 GB of data might not be all that bad -- maybe there's not all that much room for improvement. Have you checked how long it takes to run a script like this on that sort of data file?
while (<>) { @_ = split; $wc += @_ } print "counted $wc white-space-delimited tokens\n";
That would show you the limit on how quickly the file could be processed using perl.

In any case, I'd be more inclined to look for ways to economize on the amount of code being written to accomplish the task. One thing that might simplify the logic a lot would be to do record-oriented input (rather than line-oriented input):

open (NETSTATS,"$input_file"); $/ = "\nnet '"; my @field_names = ( 'wire capacitance', 'wire resistance', 'number of loads', 'number of pins', 'total wire length' ); while (<NETSTATS>) { chomp; # removes $/ from end of string; warn "got record # $.\n" if ( $. % 50000 == 0 ); # this belongs o +n STDERR, IMO next unless (/^([^']+)'/); $NetName = $1; my %field_val = (); for my $field ( @field_names ) { if ( /$field:\s+([\d.]+)/ ) { $field_val{$field} = $1; } } # .... }
(updated the "next unless" line, and added assignment to $NetName, as per OP)

I'm not going to try reimplementing the whole thing, but just that little snippet should give you the basic idea of how I would go about it. The part I've shown replaces approximately the first 50 lines of code from the OP. As for the rest, instead of testing a bunch of distinct scalar variables in order to determine what to do with the record data, you are instead checking the keys and values of a hash, which can be done with less code.

Even if it ends up running a little slower than the original (though I doubt it would), there are other advantages in terms of clarity and maintainability of the code.

And with this sort of approach, it might be easier to find tricks that will speed it up -- e.g. the regex matches in the for loop might be quicker if done like this (because with each iteration, $_ becomes shorter, and the target string is near the beginning):

for my $field ( @field_names ) { if ( s/.*?\s$field:\s+([\d.]+)//s ) { $field_val{$field} = $1; } }
The main point is that by reading the data one whole record at a time, the logic becomes a lot easier (and might end up running faster, as well).

In reply to Re: Looking for ways to speed up the parsing of a file... by graff
in thread Looking for ways to speed up the parsing of a file... by fiddler42

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.