in reply to Efficient log parsing?

You can use substr to constrain the part of the string searched (without replication). And by using the offset of the end of the last capture group you can move the window along the string efficiently:

$s = 'abc' x 10; ## repetative test data $p = 0; ## start at offset 0 ## match against the substring starting at the offset ## and capture (3) items while( my @x = substr( $s, $p ) =~ m[(.)(.)(.)] ) { print "@x"; ## do something with the captures ## and advance the offset to the end of what was matched $p += $+[3]; } a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c

When the while loop exits, the remainder beyond the offset could contain a partial match, so delete the front of the string and append the next read to the end.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Efficient log parsing?
by zrajm (Beadle) on Dec 16, 2007 at 05:12 UTC
    Yes!

    It was something like this I was looking for. Only, since my regex get passed in from the user (and I therefore I cannot know the number of parethesized subexpressions used), I need to advance my offset with $+[0];.

    Could someone possibly refer me to some point in the docs that states that this kind of use of substr really is efficient? (I know I too have read that somewhere, sometime long ago -- but where?)

      Surprised again by what is efficient/inefficient in perl.

      Turns out that trying to keep track of submatches simply isn't worth it. If I split my log using a regex (with the above mentioned parethesized subexpressions whose content I wish to retain and use too) to match the head of each log entry, it is simply many times faster to apply the regex again to each individual log entry to get the list of matching subexpressions, than it is to store them in an array-of-arrays and pass them around for later reference.

      My guess is that it is rather expensive to toss references to lists around, while matching a regex against a small text (a single log entry) where you know it is going to match at the first character position, is cheap.

      While this double-regex matching seem redundant, it has the benefit of making the program both fast, and easy to read. I'll get back to ya'all with the code soon enough. :)