Re: Efficient Way to Parse a Large Log File with a Large Regex
by hsinclai (Deacon) on Apr 12, 2005 at 17:19 UTC
|
The log file is large, it sometimes gets up to 3GB
A large file seeking technique is described here and it works very well.. the discussion was about replacing characters with tr, you can adapt it to your IP matching needs easily I think. HTH
| [reply] [d/l] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by gam3 (Curate) on Apr 12, 2005 at 17:27 UTC
|
You can give this a try and see just how slow it is.
@list = map({ quotemeta "129.$_.125.123" } (0..255));
$regex_text = join('|', @list);
$re = qr[($regex_text)];
print $re, "\n";
while (<>) {
if ($_ =~ $re) {
print "$1\n";
}
}
Update: This seems to be faster than hash method.
-- gam3
A picture is worth a thousand words, but takes 200K.
| [reply] [d/l] |
|
Thanks. This is what I came up with based on your code:
use warnings;
use strict;
my @ips = qw/192.168.2.1 ..../;
@ips = map { quotemeta } @ips;
my $regex = join('|', @ips);
my $re = qr[$regex];
while (<>) {
print if /$re/;
}
I'm then calling it liks so:
tail -f fw.log | /usr/local/scripts/parseips
It's taking up quite a bit of resources, but not bringing the server to it's knees.
Thank you for the other suggestions also. I'm going to come up with a more perm. solution based on one of these that does not require me to stare at a terminal. | [reply] [d/l] [select] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by NateTut (Deacon) on Apr 12, 2005 at 18:03 UTC
|
Process the file once, saving the offset into the file that you finished at. Then seek to that position next time and process from there to the end of the file and store the new end of file position.
This should save you a lot of redundant processing. | [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by samtregar (Abbot) on Apr 12, 2005 at 17:09 UTC
|
I don't know how efficient they are but the log-file parsing techniques in chapter 6 of Higher Order Perl are definitely worth a look. At the very least they're guaranteed to blow your mind.
-sam
| [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by Fletch (Bishop) on Apr 12, 2005 at 17:36 UTC
|
If you just want to check for the presence of one of a group of IPs it'd be much more efficient to build a hash of the IPs up front, and then parse out the IP from each record and do an exists $wanted{ $curIP } to tell if it's interesting or not.
| [reply] [d/l] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by samizdat (Vicar) on Apr 12, 2005 at 17:48 UTC
|
Oh, goody, I get to be the first one to suggest something. :) See 'a fast multipattern grep' in the Panther Book (Advanced Perl Programming), p. 74.
Your 'tail' idea is a good one, but take care not to get overrun if it gets busy. If you can syslog a marker line into the big file, that is helpful for an index. | [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by holli (Abbot) on Apr 12, 2005 at 20:07 UTC
|
Doesn't this cry for a hash-lookup?
use strict;
my @ips =
(
"192.1.20.1",
"192.1.20.2",
);
my %ips = map { $_=>1 } @ips;
open LOG, "<", "logfile" or die $!;
while ( <LOG> )
{
#match ip-address
if ( /(([0-9]+\.)+[0-9]+)/ )
{
if ( $ips{$1} )
{
# do found ip stuff here
}
else
{
#do other stuff here
}
}
}
close LOG;
| [reply] [d/l] |
|
| [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by CountZero (Bishop) on Apr 12, 2005 at 20:49 UTC
|
Saving your list of IP's to a database and checking each log-entry against this DB as soon as the log-entry gets written.If you can 'capture' the writing to the log-file and pipe it to a perl-program to extract the IP's and check it against the database that seems feasible.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
|
This seems pretty reasonable. Additionally, you could create a simple POE process to tail the log file rather than piping through tail. There are several examples at the POE website. Also, merlyn has an article on tailing a logfile and processing the entries on his website.
| [reply] |
|
It's fun to read all the replies. A lot of good ideas. I don't have anything new to add, other than this pointer to a Perl snippet by Lincoln Stein for using a DBMS for httpd logging. This approach reduces the problem of parsing log files to the much cleaner one of constructing SQL queries. And, as CountZero already pointed out, you can build in some hooks for preprocessing of log records, including one that does the checking against your table of IP addresses. Then all you have to do is check the the entries recorded with a timestamp more recent than the last check. (Incidentally, I vote for holli's hash lookup approach.)
| [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by Random_Walk (Prior) on Apr 12, 2005 at 22:57 UTC
|
as mentioned above use seek so you don't re read the log. Either qr/ / a series of regex into an array and for it or if you can swiftly split out the IP from the log i.e. if they all apear in the same position on a line you can use unpack to extract the IP and something like this may help. Even more so if your desired IPs cluster a bit in the class A octet
#!/usr/bin/perl
use strict;
use warnings;
$>++;
# get 500 random ip adresses, see genip code below
my %need;
open IP, "./genip |" or die "ooeps $!\n";
for (1..500) {
my ($a, $b, $c, $d)=split /\./, <IP>;
$need{$a}{$b}{$c}{$d}++;
}
for (1..10_000_000) {
my $ip=<IP>;
my ($a, $b, $c, $d)=split /\./, $ip;
# the compiler may optimise this line ...
# next unless exists $need{$a}{$b}{$c}{$d};
# so all the following can probably be replaced
# but it is too late for me to benchmark, g'night
next unless exists $need{$a};
next unless exists $need{$a}{$b};
next unless exists $need{$a}{$b}{$c};
print "a.b.c\n"; # see how sparse we are !
next unless exists $need{$a}{$b}{$c}{$d};
print "match ! $ip\n";
}
close IP;
__END__
# Random IP address generator used above ...
#!/usr/bin/perl
use strict;
use warnings;
while (1) {
my $ip = int rand 256;
for (1..3) {
$ip.= "." . int rand 256;
}
print $ip, $/;
}
To get any hits I upped the searched for IPs to 5000 then saw a few in reaonable time
Cheers, R.
Pereant, qui ante nos nostra dixerunt!
| [reply] [d/l] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by thor (Priest) on Apr 12, 2005 at 23:48 UTC
|
From your tail -f comment, this is a running log. If so, you can trim down your time by saving off the last position that you ended with in the file with tell and reading it back in at script start and using seek. That way, you're only looking at new entries every time.
thor
Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come
| [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by tweetiepooh (Hermit) on Apr 13, 2005 at 12:57 UTC
|
What we have done with syslog and you may be able to do depending on processing speed etc is to pipe the log writing process through a perl script en-route to the log file.
The script can then watch for required patterns as they occur and fire off some process when needed. | [reply] |
Re: Efficient Way to Parse a Large Log File with a Large Regex (with Regexp::Assemble)
by grinder (Bishop) on Apr 13, 2005 at 16:49 UTC
|
use strict;
use Regexp::Assemble;
my $re = do {
open IN, shift || 'file_of_IPs_sought' or die $!;
my $guts = Regexp::Assemble->new->add(
map {
chomp;
quotemeta($_)
} <IN>
)->as_string;
close IN;
qr/\b$guts\b/
};
open LOGFILE, shift || 'logfile' or die $!;
/$re/ and print while <LOGFILE>;
close LOGFILE;
# update: if this is a pipe...
/$re/ and print while <>;
The expression will probably turn out to be about the same size as the list of IPs. The more they cluster, the smaller the pattern will be. And 500 patterns will barely have Regexp::Assemble breaking a sweat.
- another intruder with the mooring in the heart of the Perl
| [reply] [d/l] |
|
| [reply] [d/l] |
|
/$re/ and print while <>
and
while( <> ) {
while( /\b(\d+\.\d+\.\d+\.\d+)\b/g ) {
if( exists $ip{$1} ) {
print;
last;
}
}
}
Hugely slower? No. A quick benchmark here shows that the regular expression appoach is about twice as slow (and we are talking about a problem dominated by disk I/O anyway). One factor depends on how many naked IPs appear on a line. If there are several and only one interests you, the direct regexp will pick it up immediately, whereas the hash approach will have to test each one.
Another consideration is that if you want to extend the approach to search for e.g. 192.168.0.* then you can no longer use the hash approach at all, since what gets matched does not correspond to any key.
Or else I completely misread the question, in which case consider my solution withdrawn.
- another intruder with the mooring in the heart of the Perl
| [reply] [d/l] [select] |
Re: Efficient Way to Parse a Large Log File with a Large Regex
by tphyahoo (Vicar) on Apr 13, 2005 at 15:54 UTC
|
| [reply] |