Re: Not sure what I am doing wrong
by CountZero (Bishop) on Nov 24, 2008 at 16:50 UTC
|
Contrary to what my fellow Monks are telling, it is not always a good idea to join your single regex-patterns with |and then use this combined regex on all your tests. Alternations in a regex are processed in O(n) time so increasing the number of alternations has a linear effect on your processing time.To start with if you do my $RE = join '|', <RECORDS>;
and then later /$RE/;
for every test, Perl will have to re-compile your pattern every time. It is much better to pre-compile the pattern, by doing my $RE = join '|', <RECORDS>;
my $compiled_RE = qr/$RE/;
and then later /$compiled_RE/;
In that case you only pay the compilation cost once.Even better is it to use a module such as Regexp::Assemble which can automatically compile for you a more efficient regex. Cutting down on the number of alternations will lead to a faster regex. I tried this in a program of mine and by using Regexp::Assemble managed to get a 30% speed increase when using real world data.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
As his original post had stated that he wanted to look for certain strings, not patterns, I think that the
chomp(my @RE = <RECORDS>);
my $RE = join '|', map quotemeta, @RE; $RE = qr/($RE)/;
will possibly be the most efficient matcher (at least on 5.10 ...), no?
[]s, HTH, Massa (κς,πμ,πλ)
| [reply] [d/l] |
|
|
Rate Raw Regex Assembled Regex
Raw Regex 614/s -- -9%
Assembled Regex 676/s 10% --
The text to check was part of a play by Shakespeare, loaded into an array (1 line per element) and the regex used was qr /this|that|here|there|where/;
Assembled into a more efficient regex it looks like:
qr /(?:th(?:ere|at|is)|w?here)/
Even by eliminating only ONE of the alternations (5 vs 4) a 10% speed-up was obtained.
PS: I do not know about 5.10, this was tested in 5.8.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
Re: Not sure what I am doing wrong
by DStaal (Chaplain) on Nov 24, 2008 at 14:57 UTC
|
What, exactly, do you think is going wrong? Yes, this looks like it is reading though the files, and checking each line of the files against each of your keys. Which seems to be both what you are trying to do and what you are worried about.
I can see easier, and maybe faster ways to code some of the above (check out IO::Uncompress::Gunzip), but I'm not sure what you think the problem is.
| [reply] |
|
|
Thanks for the reply, I guess its not so much a problem as I think there maybe a better way to search through the logs for each keywork(or hoping there is). The problem is that each log file is about 10-18mb in size compressed so searching for each 100 keyword against every single line in the log file takes a really long time. Takes about 40-50 minutes to search each log file for the 100 keywords. Not sure if there is a way to search faster?
| [reply] |
|
|
40-50 minutes per log file? Yeah, you should be able to speed that up... (Dependent on hardware of course, but my Mail::Log::Parse modules go through 8-9MB (after compression) logfiles in around a minute per (compressed) megabyte, and they do a lot more work than this.)
The other comments on combining patterns could be useful, but the other thing to look at is what exactly your patterns are. Small differences in patterns can make some fairly major differences in speed. (Of course, if you are doing literals it's probably not that bad.)
| [reply] |
Re: Not sure what I am doing wrong
by massa (Hermit) on Nov 24, 2008 at 14:59 UTC
|
sub queries() {
if ( -f and /^ftr001/ ) {
print "----------------------------------------------\n";
print "Searching File:", $_, "\n";
print "----------------------------------------------\n";
open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!";
open(RECORDS, "/tmp/logs/records.txt") or die "Could not open
+records file: $!";
my @REC=<RECORDS>;
chomp @REC;
my $RE = join '|', map { quotemeta $_ } @REC;
$RE = qr/($RE)/;
while ( <UNZIP> ) {
print "Searching for: $1\nFound it in: $_\n" if /$RE/
}
}
}
[]s, HTH, Massa (κς,πμ,πλ)
| [reply] [d/l] |
Re: Not sure what I am doing wrong
by Bloodnok (Vicar) on Nov 24, 2008 at 15:02 UTC
|
The first thing you should do is to get the close() calls in the right context...
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
my $DIR = "/tmp/logs/";
find(\&queries, $DIR);
sub queries() {
if ( -f and /^ftr001/ ) {
print "----------------------------------------------\n";
print "Searching File:", $_, "\n";
print "----------------------------------------------\n";
open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!";
open(RECORDS, "/tmp/logs/records.txt") or die "Could not open
+records file: $!";
my @REC=<RECORDS>;
chomp @REC;
close RECORDS;
while ( <UNZIP> ) {
my $EACH_LINE = $_;
foreach my $REC_QUERY (@REC) {
print "Searching for current record: $REC_QUERY\n";
if ( $EACH_LINE =~ /\($REC_QUERY\)/ ) {
print "Querying for record: $REC_QUERY\n";
print "$EACH_LINE\n";
}
}
}
close UNZIP;
}
}
...in that way you'll be closing the files at the right time i.e. after you've finished with them. Otherwise you'll encounter problems...
After that, if I've understood you correctly, the simplest approach would be to convert the sought after records into an RE and use that against each file. Moreover, there's a marked performance hit by opening the records file once for every directory - c/w once at the outset - something along the lines of the following (untested) snippet...
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
my $DIR = "/tmp/logs/";
find(\&queries, $DIR);
open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records
+file: $!";
local $/ = undef; # Ignoring binmode as its a text file
my $RE = join '|', <RECORDS>;
close RECORDS;
sub queries() {
return unless -f and /^ftr001/;
print "----------------------------------------------\n";
print "Searching File:", $_, "\n";
print "----------------------------------------------\n";
open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!";
while ( <UNZIP> ) {
next unless /$RE/;
print $_;
}
close UNZIP;
}
A user level that continues to overstate my experience :-))
| [reply] [d/l] [select] |
|
|
Thanks for the quick reply. I used the code above which looks to be correct and I get the following error message? Not 100% sure why that error is coming up.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11802.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11803.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11804.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11805.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11806.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11807.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11808.
Use of uninitialized value in regexp compilation at dns.pl line 23, <U
+NZIP> line 11809.
Did you mean to put the following out of the sub()?
open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records
+file: $!";
local $/ = undef; # Ignoring binmode as its a text file
my $RE = join '|', <RECORDS>;
close RECORDS;
I tried moving the stuff above inside the sub and printed the $RE var and got the correct records but got no results. Not sure what the problem is. All in all what I am trying to figure out if there is a faster way to search for each of the 100 key entries in my file against all the logs files. The current problem(if thats what you want to call it) is that it takes almost 1 hour to search for 100 keywords against each of the log files which are about 15-18 MB in size.
Thanks again for all the HELP!! | [reply] [d/l] [select] |
|
|
use File::Find;
my $DIR = "/tmp/logs/";
open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records
+file: $!";
local $/ = undef; # Ignoring binmode as its a text file
my $RE = join '|', <RECORDS>;
close RECORDS;
find(\&queries, $DIR);
Oops, sorry, my fault - as shown, the $RE lexical variable wasn't defined until after use (by the queries sub).
A user level that continues to overstate my experience :-))
| [reply] [d/l] [select] |
Re: Not sure what I am doing wrong
by f00li5h (Chaplain) on Nov 24, 2008 at 15:02 UTC
|
$regex = join '|', map quotemeta, @REC;
print "I want $EACH_LINE" if $EACH_LINE =~ $regex
It will get you a pattern made up of all the lines in the file, joined by | which will match any of them
It may need anchoring if the lines are to match the full line instead
Also, the quotemeta presumes you just want literal strings (remove it if @REC contains regexen)
@_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;
| [reply] [d/l] [select] |