Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Efficient Way to Parse a Large Log File with a Large Regex

by Dru (Hermit)
on Apr 12, 2005 at 17:02 UTC ( [id://447093]=perlquestion: print w/replies, xml ) Need Help??

Dru has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have an array with almost 500 ip's that I want to see if any of them appears in a log file. The log file is large, it sometimes gets up to 3GB. I was wanting to run this script from cron every hour to see if any of these ip's appear, but I'm thinking this might be too much of a load on the server (Dual CPU, 2GB memory, RedHat ES 3.0) so I might run it just a few times a day. I also thought about doing a tail -f logfile | <name of program>.pl, to look at just new log entries, but again I'm concerned about the server being able to keep up.

Anyway, I'm looking for suggestions on how to efficiently parse this much data. I initially was going to build a regex group, but not capture, all of the ip's with an alternation between each ip. Something along the lines of:
/(?:192\.168\.1\.1|192\.168\.2\.1)/
BTW, the ip's are not in a nice sequential order like above, they are all over the place.

Actually, I still haven't figured out how I was going to get from the array to the regex. I was thinking I could use map to build the regex, but I'm still a map newbie. I did backslash each decimal like this:
@ips = map { quotemeta } @ips; my $file = shift;
So I guess my questions are:

1. Is creating a regex, like the one discussed above, going to be the most efficient way?

2. If yes to number 1, any suggestions on how to build a regex from the array?

P.S. I know the term efficient can vary greatly from one programmer to the next, but I'm just looking for suggestions.

-Dru

Replies are listed 'Best First'.
Re: Efficient Way to Parse a Large Log File with a Large Regex
by hsinclai (Deacon) on Apr 12, 2005 at 17:19 UTC
    The log file is large, it sometimes gets up to 3GB

    A large file seeking technique is described here and it works very well.. the discussion was about replacing characters with  tr, you can adapt it to your IP matching needs easily I think. HTH

Re: Efficient Way to Parse a Large Log File with a Large Regex
by gam3 (Curate) on Apr 12, 2005 at 17:27 UTC
    You can give this a try and see just how slow it is.
    @list = map({ quotemeta "129.$_.125.123" } (0..255)); $regex_text = join('|', @list); $re = qr[($regex_text)]; print $re, "\n"; while (<>) { if ($_ =~ $re) { print "$1\n"; } }
    Update: This seems to be faster than hash method.
    -- gam3
    A picture is worth a thousand words, but takes 200K.
      Thanks. This is what I came up with based on your code:
      use warnings; use strict; my @ips = qw/192.168.2.1 ..../; @ips = map { quotemeta } @ips; my $regex = join('|', @ips); my $re = qr[$regex]; while (<>) { print if /$re/; }
      I'm then calling it liks so:
      tail -f fw.log | /usr/local/scripts/parseips
      It's taking up quite a bit of resources, but not bringing the server to it's knees.

      Thank you for the other suggestions also. I'm going to come up with a more perm. solution based on one of these that does not require me to stare at a terminal.
Re: Efficient Way to Parse a Large Log File with a Large Regex
by NateTut (Deacon) on Apr 12, 2005 at 18:03 UTC
    Process the file once, saving the offset into the file that you finished at. Then seek to that position next time and process from there to the end of the file and store the new end of file position.

    This should save you a lot of redundant processing.
Re: Efficient Way to Parse a Large Log File with a Large Regex
by samtregar (Abbot) on Apr 12, 2005 at 17:09 UTC
    I don't know how efficient they are but the log-file parsing techniques in chapter 6 of Higher Order Perl are definitely worth a look. At the very least they're guaranteed to blow your mind.

    -sam

Re: Efficient Way to Parse a Large Log File with a Large Regex
by Fletch (Bishop) on Apr 12, 2005 at 17:36 UTC

    If you just want to check for the presence of one of a group of IPs it'd be much more efficient to build a hash of the IPs up front, and then parse out the IP from each record and do an exists $wanted{ $curIP } to tell if it's interesting or not.

Re: Efficient Way to Parse a Large Log File with a Large Regex
by samizdat (Vicar) on Apr 12, 2005 at 17:48 UTC
    Oh, goody, I get to be the first one to suggest something. :) See 'a fast multipattern grep' in the Panther Book (Advanced Perl Programming), p. 74.

    Your 'tail' idea is a good one, but take care not to get overrun if it gets busy. If you can syslog a marker line into the big file, that is helpful for an index.
Re: Efficient Way to Parse a Large Log File with a Large Regex
by holli (Abbot) on Apr 12, 2005 at 20:07 UTC
    Doesn't this cry for a hash-lookup?
    use strict; my @ips = ( "192.1.20.1", "192.1.20.2", ); my %ips = map { $_=>1 } @ips; open LOG, "<", "logfile" or die $!; while ( <LOG> ) { #match ip-address if ( /(([0-9]+\.)+[0-9]+)/ ) { if ( $ips{$1} ) { # do found ip stuff here } else { #do other stuff here } } } close LOG;


    holli, /regexed monk/

      That and a dispatch table possibly..especially if you have specific logging/functions you want to happen for certain IP addresses.

Re: Efficient Way to Parse a Large Log File with a Large Regex
by CountZero (Bishop) on Apr 12, 2005 at 20:49 UTC
    Saving your list of IP's to a database and checking each log-entry against this DB as soon as the log-entry gets written.

    If you can 'capture' the writing to the log-file and pipe it to a perl-program to extract the IP's and check it against the database that seems feasible.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      This seems pretty reasonable. Additionally, you could create a simple POE process to tail the log file rather than piping through tail. There are several examples at the POE website. Also, merlyn has an article on tailing a logfile and processing the entries on his website.

      It's fun to read all the replies. A lot of good ideas. I don't have anything new to add, other than this pointer to a Perl snippet by Lincoln Stein for using a DBMS for httpd logging. This approach reduces the problem of parsing log files to the much cleaner one of constructing SQL queries. And, as CountZero already pointed out, you can build in some hooks for preprocessing of log records, including one that does the checking against your table of IP addresses. Then all you have to do is check the the entries recorded with a timestamp more recent than the last check. (Incidentally, I vote for holli's hash lookup approach.)

      the lowliest monk

Re: Efficient Way to Parse a Large Log File with a Large Regex
by Random_Walk (Prior) on Apr 12, 2005 at 22:57 UTC

    as mentioned above use seek so you don't re read the log. Either qr/ / a series of regex into an array and for it or if you can swiftly split out the IP from the log i.e. if they all apear in the same position on a line you can use unpack to extract the IP and something like this may help. Even more so if your desired IPs cluster a bit in the class A octet

    #!/usr/bin/perl use strict; use warnings; $>++; # get 500 random ip adresses, see genip code below my %need; open IP, "./genip |" or die "ooeps $!\n"; for (1..500) { my ($a, $b, $c, $d)=split /\./, <IP>; $need{$a}{$b}{$c}{$d}++; } for (1..10_000_000) { my $ip=<IP>; my ($a, $b, $c, $d)=split /\./, $ip; # the compiler may optimise this line ... # next unless exists $need{$a}{$b}{$c}{$d}; # so all the following can probably be replaced # but it is too late for me to benchmark, g'night next unless exists $need{$a}; next unless exists $need{$a}{$b}; next unless exists $need{$a}{$b}{$c}; print "a.b.c\n"; # see how sparse we are ! next unless exists $need{$a}{$b}{$c}{$d}; print "match ! $ip\n"; } close IP; __END__ # Random IP address generator used above ... #!/usr/bin/perl use strict; use warnings; while (1) { my $ip = int rand 256; for (1..3) { $ip.= "." . int rand 256; } print $ip, $/; }
    To get any hits I upped the searched for IPs to 5000 then saw a few in reaonable time

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: Efficient Way to Parse a Large Log File with a Large Regex
by thor (Priest) on Apr 12, 2005 at 23:48 UTC
    From your tail -f comment, this is a running log. If so, you can trim down your time by saving off the last position that you ended with in the file with tell and reading it back in at script start and using seek. That way, you're only looking at new entries every time.

    thor

    Feel the white light, the light within
    Be your own disciple, fan the sparks of will
    For all of us waiting, your kingdom will come

Re: Efficient Way to Parse a Large Log File with a Large Regex
by tweetiepooh (Hermit) on Apr 13, 2005 at 12:57 UTC
    What we have done with syslog and you may be able to do depending on
    processing speed etc is to pipe the log writing process through a
    perl script en-route to the log file.

    The script can then watch for required patterns as they occur and fire off
    some process when needed.
Re: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble)
by grinder (Bishop) on Apr 13, 2005 at 16:49 UTC
    Is creating a regex, like the one discussed above, going to be the most efficient way?

    It would be if you used Regexp::Assemble :) The code would look something like

    use strict; use Regexp::Assemble; my $re = do { open IN, shift || 'file_of_IPs_sought' or die $!; my $guts = Regexp::Assemble->new->add( map { chomp; quotemeta($_) } <IN> )->as_string; close IN; qr/\b$guts\b/ }; open LOGFILE, shift || 'logfile' or die $!; /$re/ and print while <LOGFILE>; close LOGFILE; # update: if this is a pipe... /$re/ and print while <>;

    The expression will probably turn out to be about the same size as the list of IPs. The more they cluster, the smaller the pattern will be. And 500 patterns will barely have Regexp::Assemble breaking a sweat.

    - another intruder with the mooring in the heart of the Perl

      But it will be hugely slower than doing a simple search to find the embedded IP and then look that up in a hash that contains the 500 IPs in question.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?

        It all comes down to the difference between

        /$re/ and print while <>
        and
        while( <> ) { while( /\b(\d+\.\d+\.\d+\.\d+)\b/g ) { if( exists $ip{$1} ) { print; last; } } }

        Hugely slower? No. A quick benchmark here shows that the regular expression appoach is about twice as slow (and we are talking about a problem dominated by disk I/O anyway). One factor depends on how many naked IPs appear on a line. If there are several and only one interests you, the direct regexp will pick it up immediately, whereas the hash approach will have to test each one.

        Another consideration is that if you want to extend the approach to search for e.g. 192.168.0.* then you can no longer use the hash approach at all, since what gets matched does not correspond to any key.

        Or else I completely misread the question, in which case consider my solution withdrawn.

        - another intruder with the mooring in the heart of the Perl

Re: Efficient Way to Parse a Large Log File with a Large Regex
by tphyahoo (Vicar) on Apr 13, 2005 at 15:54 UTC
    Unix Tail, for those that didn't know (like me), displays the last ten lines of a file.

    http://www.techonthenet.com/unix/basic/tail.htm

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://447093]
Approved by RazorbladeBidet
Front-paged by RazorbladeBidet
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-25 09:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found