Efficient log parsing?

zrajm has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a module to (I hope) efficiently parse large logfiles with a variable entry length (most of the log entries are more than one line, some several hundred lines long). With each log entry I also need to extract the time/date.

As I go about it right now I have a regular expression matching the beginning of each log entry, with parenthisized subexpressions capturing year, day, month, hour, minute and second of the entry.

I load a chunk of the logfile in question into $_, using sysread. Then I split this into log entries, and subexpressions and put it into an array local to my module. The array looks like this:

    [
        [ "LOGENTRY 1", $1, $2, $3, $4..],
        [ "LOGENTRY 2", $1, $2, $3, $4..],
        ...
    ]
[download]

Each subsequent call to my module's next_entry() simply shifts the top element off the array and return that, until there's only one element left -- then I add another chunk from the logfile, splits that into an array-of-arrays and start the whole circus over again.

However, I'm having some problems in the logfile chunk parsing. Right now, it looks like this:

while (/$regex/g) {
    push @buffer, [
        substr($_, $last, $-[0]-$last),
        $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12,
    ];
    $last = $-[0];
}
[download]

At first I tried

while (@x = /($regex)/g) {
    push @buffer, [ @x ];
}
[download]

..but to my suprise that didn't work. Turns out that the m//g in list context returns one list of all subexpressions matched, rather than (as m// without the /g flag) one list of subexpressions per match.

How should I go about this? Perhaps there's some better method that I have completely missed?

P.S. Note that my regex does not match an *entire* log entry, just the beginning of it.

Comment on Efficient log parsing? Select or Download Code

Replies are listed 'Best First'.
Re: Efficient log parsing? by BrowserUk (Patriarch) on Dec 14, 2007 at 07:38 UTC
You can use substr to constrain the part of the string searched (without replication). And by using the offset of the end of the last capture group you can move the window along the string efficiently: `$s = 'abc' x 10; ## repetative test data $p = 0; ## start at offset 0 ## match against the substring starting at the offset ## and capture (3) items while( my @x = substr( $s, $p ) =~ m[(.)(.)(.)] ) { print "@x"; ## do something with the captures ## and advance the offset to the end of what was matched $p += $+[3]; } a b c a b c a b c a b c a b c a b c a b c a b c a b c a b c` [download] When the while loop exits, the remainder beyond the offset could contain a partial match, so delete the front of the string and append the next read to the end. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Efficient log parsing? by zrajm (Beadle) on Dec 16, 2007 at 05:12 UTC
Yes! It was something like this I was looking for. Only, since my regex get passed in from the user (and I therefore I cannot know the number of parethesized subexpressions used), I need to advance my offset with `$+[0];`. Could someone possibly refer me to some point in the docs that states that this kind of use of substr really is efficient? (I know I too have read that somewhere, sometime long ago -- but where?)	[reply] [d/l]
Re^3: Efficient log parsing? by zrajm (Beadle) on Dec 17, 2007 at 19:41 UTC
Surprised again by what is efficient/inefficient in perl. Turns out that trying to keep track of submatches simply isn't worth it. If I split my log using a regex (with the above mentioned parethesized subexpressions whose content I wish to retain and use too) to match the head of each log entry, it is simply many times faster to apply the regex again to each individual log entry to get the list of matching subexpressions, than it is to store them in an array-of-arrays and pass them around for later reference. My guess is that it is rather expensive to toss references to lists around, while matching a regex against a small text (a single log entry) where you know it is going to match at the first character position, is cheap. While this double-regex matching seem redundant, it has the benefit of making the program both fast, and easy to read. I'll get back to ya'all with the code soon enough. :)	[reply]
Re: Efficient log parsing? by steved (Initiate) on Dec 14, 2007 at 05:00 UTC
by using @x in `@x = /($regex)/g` you're telling perl to return a list from the regex. change it to a $x and the regex will return scalars	[reply] [d/l]
Re^2: Efficient log parsing? by zrajm (Beadle) on Dec 14, 2007 at 07:09 UTC
Ahh, but I'm also looking to obtain any subpatterns (paranthethized expressions) inside my regex. Putting `@x` there was ment to capture them. I naīvely thought m//g would return the same kind of stuff as m// without the /g flag, so that I could iterate over the string, obtaining one array of subpatterns at a time. But m// resets pos() and always begin at the start of the string, so that won't do either. Is there some way I could get m// to start searching where the last match left off? Using substr()? How efficient is that? Does it make a new copy of the string, or operate directly on the original?	[reply] [d/l]
Re: Efficient log parsing? by NetWallah (Canon) on Dec 14, 2007 at 05:54 UTC
Add the "c" modifier, if you wan an array-extraction without position reset: "If you don't want the position reset after failure to match, add the //c, as in /regexp/gc." So your code should read: `while (@x = /($regex)/gc) { push @buffer, [ @x ]; # Or, if you localize (my) @x, push @buffer, +\@x ; }` [download] "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l]
Re^2: Efficient log parsing? by zrajm (Beadle) on Dec 14, 2007 at 07:01 UTC
No. That won't work. The loop will only be executed once because of the /g flag, which causes all matches to be returned at once. Sorry. :(	[reply]
Re^3: Efficient log parsing? by NetWallah (Canon) on Dec 15, 2007 at 07:05 UTC
Good catch. OK - here is a somewhat contrived variation that seems to work for me: `use strict; $_="this that other thing that thw tests sometimes"; my $regex = qr/(t..)\w+\s+?(\w\w)(.+)/; my @buffer; my @x; while ( (@x[0..1],$_) = /$regex/) { push @buffer, [ @x ]; print join(",",@x) . ";\n"; }` [download] Changes: You need to know how many elements to expect The regex has been modified to return all remaining information as the final element The final element gets re-stored into $_ No more /g modifier "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l]
Re: Efficient log parsing? by lihao (Monk) on Dec 14, 2007 at 06:08 UTC
Can you reset the IRS $/, that might be much easier than using sysread which might break down your records. If you do need that AOA, you can maintain a queue in fixed-size or limited-size or whatever. BTW. be very caution when using regexes with large chucks of arbitrary texts. Regards Lihao(XC)	[reply]