Even if you can't load the entire log into memory, loading it in chunks should speed things up.

1. You can take the hash with the entries and compact a few thousand of them (or however much memory you want to use). OR you can grab a chunk of the log data from a file with newlines, and read to the next line <IN>.

2. Run search/replace for each regex over the entire buffer, by evaluating the s/foo/bar/g in list context. It can still be found how many replacements were done for each regex.


Code:

Get a chunk of data:

my $bufflen = 4 * 1024; #or w/e do { $result = read ( IN, $buffer, $bufflen-length($buffer), length($buffer) ); } while ( $result && ( length($buffer) < $bufflen ) );

If the chunk ends in the middle of a line, strip off the remainder and save it for the next chunk: (not needed if each entry is separated beforehand)

my $newline = "\n"; #or some other unique record separator ## if we're not at eof if ($result > 0) { my $last_newline = rindex $buffer, $newline; my $remainderlen = length($buffer)-$last_newline-length($newline); if ($remainderlen <= 0) { $remainder = ''; } else { $remainder = substr($buffer, $last_newline+length($newline), $remainderlen, ''); } } ## this is important: prefix the remainder before next chunk $buffer = $remainder;
__EDIT__: Or instead of the above you could just do readline like BrowserUk did. D'oh!
$buffer .= <IN>;
Then apply your regexes: (and count how many replacements were done)
foreach my $regex (@conversions) { my @results = ( $buffer =~ s/$regex->{from}/$regex->{to}/g ); my $reps_done = 0; grep { $reps_done += $_ } @results; $regex->{count} += $reps_done; } ## and do whatever with the result print OUT $buffer;

The above would be inside a block which loops over each chunk until the end of the log file is reached


In reply to Re: Recommendations for efficient data reduction/substitution application by ipherian
in thread Recommendations for efficient data reduction/substitution application by atcroft

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.