Efficient Way to Parse a Large Log File with a Large Regex

Dru has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient Way to Parse a Large Log File with a Large Regex by hsinclai (Deacon) on Apr 12, 2005 at 17:19 UTC
The log file is large, it sometimes gets up to 3GB A large file seeking technique is described here and it works very well.. the discussion was about replacing characters with `tr`, you can adapt it to your IP matching needs easily I think. HTH	[reply] [d/l]
Re: Efficient Way to Parse a Large Log File with a Large Regex by gam3 (Curate) on Apr 12, 2005 at 17:27 UTC
You can give this a try and see just how slow it is. `@list = map({ quotemeta "129.$_.125.123" } (0..255)); $regex_text = join('\|', @list); $re = qr[($regex_text)]; print $re, "\n"; while (<>) { if ($_ =~ $re) { print "$1\n"; } }` [download] Update: This seems to be faster than hash method. -- gam3 A picture is worth a thousand words, but takes 200K.	[reply] [d/l]
Re^2: Efficient Way to Parse a Large Log File with a Large Regex by Dru (Hermit) on Apr 12, 2005 at 19:30 UTC
Thanks. This is what I came up with based on your code: `use warnings; use strict; my @ips = qw/192.168.2.1 ..../; @ips = map { quotemeta } @ips; my $regex = join('\|', @ips); my $re = qr[$regex]; while (<>) { print if /$re/; }` [download] I'm then calling it liks so: `tail -f fw.log \| /usr/local/scripts/parseips` [download] It's taking up quite a bit of resources, but not bringing the server to it's knees. Thank you for the other suggestions also. I'm going to come up with a more perm. solution based on one of these that does not require me to stare at a terminal.	[reply] [d/l] [select]
Re: Efficient Way to Parse a Large Log File with a Large Regex by NateTut (Deacon) on Apr 12, 2005 at 18:03 UTC
Process the file once, saving the offset into the file that you finished at. Then seek to that position next time and process from there to the end of the file and store the new end of file position. This should save you a lot of redundant processing.	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex by samtregar (Abbot) on Apr 12, 2005 at 17:09 UTC
I don't know how efficient they are but the log-file parsing techniques in chapter 6 of Higher Order Perl are definitely worth a look. At the very least they're guaranteed to blow your mind. -sam	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex by Fletch (Bishop) on Apr 12, 2005 at 17:36 UTC
If you just want to check for the presence of one of a group of IPs it'd be much more efficient to build a hash of the IPs up front, and then parse out the IP from each record and do an `exists $wanted{ $curIP }` to tell if it's interesting or not.	[reply] [d/l]
Re: Efficient Way to Parse a Large Log File with a Large Regex by samizdat (Vicar) on Apr 12, 2005 at 17:48 UTC
Oh, goody, I get to be the first one to suggest something. :) See 'a fast multipattern grep' in the Panther Book (Advanced Perl Programming), p. 74. Your 'tail' idea is a good one, but take care not to get overrun if it gets busy. If you can syslog a marker line into the big file, that is helpful for an index.	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex by holli (Abbot) on Apr 12, 2005 at 20:07 UTC
Doesn't this cry for a hash-lookup? `use strict; my @ips = ( "192.1.20.1", "192.1.20.2", ); my %ips = map { $_=>1 } @ips; open LOG, "<", "logfile" or die $!; while ( <LOG> ) { #match ip-address if ( /(([0-9]+\.)+[0-9]+)/ ) { if ( $ips{$1} ) { # do found ip stuff here } else { #do other stuff here } } } close LOG;` [download] holli, /regexed monk/	[reply] [d/l]
Re^2: Efficient Way to Parse a Large Log File with a Large Regex by Grygonos (Chaplain) on Apr 12, 2005 at 20:38 UTC
That and a dispatch table possibly..especially if you have specific logging/functions you want to happen for certain IP addresses. Grygonos	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex by CountZero (Bishop) on Apr 12, 2005 at 20:49 UTC
Saving your list of IP's to a database and checking each log-entry against this DB as soon as the log-entry gets written. If you can 'capture' the writing to the log-file and pipe it to a perl-program to extract the IP's and check it against the database that seems feasible. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^2: Efficient Way to Parse a Large Log File with a Large Regex by Steve_p (Priest) on Apr 12, 2005 at 21:39 UTC
This seems pretty reasonable. Additionally, you could create a simple POE process to tail the log file rather than piping through tail. There are several examples at the POE website. Also, merlyn has an article on tailing a logfile and processing the entries on his website.	[reply]
Re^2: Efficient Way to Parse a Large Log File with a Large Regex by tlm (Prior) on Apr 13, 2005 at 01:06 UTC
It's fun to read all the replies. A lot of good ideas. I don't have anything new to add, other than this pointer to a Perl snippet by Lincoln Stein for using a DBMS for httpd logging. This approach reduces the problem of parsing log files to the much cleaner one of constructing SQL queries. And, as CountZero already pointed out, you can build in some hooks for preprocessing of log records, including one that does the checking against your table of IP addresses. Then all you have to do is check the the entries recorded with a timestamp more recent than the last check. (Incidentally, I vote for holli's hash lookup approach.) the lowliest monk	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex by Random_Walk (Prior) on Apr 12, 2005 at 22:57 UTC
as mentioned above use seek so you don't re read the log. Either qr/ / a series of regex into an array and for it or if you can swiftly split out the IP from the log i.e. if they all apear in the same position on a line you can use unpack to extract the IP and something like this may help. Even more so if your desired IPs cluster a bit in the class A octet #!/usr/bin/perl use strict; use warnings; $>++; # get 500 random ip adresses, see genip code below my %need; open IP, "./genip \|" or die "ooeps $!\n"; for (1..500) { my ($a, $b, $c, $d)=split /\./, <IP>; $need{$a}{$b}{$c}{$d}++; } for (1..10_000_000) { my $ip=<IP>; my ($a, $b, $c, $d)=split /\./, $ip; # the compiler may optimise this line ... # next unless exists $need{$a}{$b}{$c}{$d}; # so all the following can probably be replaced # but it is too late for me to benchmark, g'night next unless exists $need{$a}; next unless exists $need{$a}{$b}; next unless exists $need{$a}{$b}{$c}; print "a.b.c\n"; # see how sparse we are ! next unless exists $need{$a}{$b}{$c}{$d}; print "match ! $ip\n"; } close IP; __END__ # Random IP address generator used above ... #!/usr/bin/perl use strict; use warnings; while (1) { my $ip = int rand 256; for (1..3) { $ip.= "." . int rand 256; } print $ip, $/; } [download] To get any hits I upped the searched for IPs to 5000 then saw a few in reaonable time Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l]
Re: Efficient Way to Parse a Large Log File with a Large Regex by thor (Priest) on Apr 12, 2005 at 23:48 UTC
From your `tail -f` comment, this is a running log. If so, you can trim down your time by saving off the last position that you ended with in the file with `tell` and reading it back in at script start and using `seek`. That way, you're only looking at new entries every time. thor `Feel the white light, the light within Be your own disciple, fan the sparks of will For all of us waiting, your kingdom will come`	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex by tweetiepooh (Hermit) on Apr 13, 2005 at 12:57 UTC
What we have done with syslog and you may be able to do depending on processing speed etc is to pipe the log writing process through a perl script en-route to the log file. The script can then watch for required patterns as they occur and fire off some process when needed.	[reply]
Re: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble) by grinder (Bishop) on Apr 13, 2005 at 16:49 UTC
Is creating a regex, like the one discussed above, going to be the most efficient way? It would be if you used Regexp::Assemble :) The code would look something like `use strict; use Regexp::Assemble; my $re = do { open IN, shift \|\| 'file_of_IPs_sought' or die $!; my $guts = Regexp::Assemble->new->add( map { chomp; quotemeta($_) } <IN> )->as_string; close IN; qr/\b$guts\b/ }; open LOGFILE, shift \|\| 'logfile' or die $!; /$re/ and print while <LOGFILE>; close LOGFILE; # update: if this is a pipe... /$re/ and print while <>;` [download] The expression will probably turn out to be about the same size as the list of IPs. The more they cluster, the smaller the pattern will be. And 500 patterns will barely have `Regexp::Assemble` breaking a sweat. - another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re^2: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble) by BrowserUk (Patriarch) on Apr 13, 2005 at 17:12 UTC
But it will be hugely slower than doing a simple search to find the embedded IP and then look that up in a hash that contains the 500 IPs in question. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]
Re^3: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble) by grinder (Bishop) on Apr 14, 2005 at 09:43 UTC
It all comes down to the difference between `/$re/ and print while <>` [download] and `while( <> ) { while( /\b(\d+\.\d+\.\d+\.\d+)\b/g ) { if( exists $ip{$1} ) { print; last; } } }` [download] Hugely slower? No. A quick benchmark here shows that the regular expression appoach is about twice as slow (and we are talking about a problem dominated by disk I/O anyway). One factor depends on how many naked IPs appear on a line. If there are several and only one interests you, the direct regexp will pick it up immediately, whereas the hash approach will have to test each one. Another consideration is that if you want to extend the approach to search for e.g. 192.168.0.* then you can no longer use the hash approach at all, since what gets matched does not correspond to any key. Or else I completely misread the question, in which case consider my solution withdrawn. - another intruder with the mooring in the heart of the Perl	[reply] [d/l] [select]
Re: Efficient Way to Parse a Large Log File with a Large Regex by tphyahoo (Vicar) on Apr 13, 2005 at 15:54 UTC
Unix Tail, for those that didn't know (like me), displays the last ten lines of a file. http://www.techonthenet.com/unix/basic/tail.htm	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks