Not sure what I am doing wrong

learningperl01 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Not sure what I am doing wrong by CountZero (Bishop) on Nov 24, 2008 at 16:50 UTC
Contrary to what my fellow Monks are telling, it is not always a good idea to join your single regex-patterns with `\|`and then use this combined regex on all your tests. Alternations in a regex are processed in O(n) time so increasing the number of alternations has a linear effect on your processing time. To start with if you do `my $RE = join '\|', <RECORDS>;` [download] and then later `/$RE/;` [download] for every test, Perl will have to re-compile your pattern every time. It is much better to pre-compile the pattern, by doing `my $RE = join '\|', <RECORDS>; my $compiled_RE = qr/$RE/;` [download] and then later `/$compiled_RE/;` [download] In that case you only pay the compilation cost once. Even better is it to use a module such as Regexp::Assemble which can automatically compile for you a more efficient regex. Cutting down on the number of alternations will lead to a faster regex. I tried this in a program of mine and by using Regexp::Assemble managed to get a 30% speed increase when using real world data. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: Not sure what I am doing wrong by massa (Hermit) on Nov 25, 2008 at 10:44 UTC
As his original post had stated that he wanted to look for certain strings, not patterns, I think that the `chomp(my @RE = <RECORDS>); my $RE = join '\|', map quotemeta, @RE; $RE = qr/($RE)/;` [download] will possibly be the most efficient matcher (at least on 5.10 ...), no? []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l]
Re^3: Not sure what I am doing wrong by CountZero (Bishop) on Nov 25, 2008 at 22:46 UTC
Even then, the less alternations you have the faster your regex will run. Of course if the strings or patterns have nothing in common it makes no difference. I did a quick test and got the following results: `Rate Raw Regex Assembled Regex Raw Regex 614/s -- -9% Assembled Regex 676/s 10% --` [download] The text to check was part of a play by Shakespeare, loaded into an array (1 line per element) and the regex used was `qr /this\|that\|here\|there\|where/;` [download] Assembled into a more efficient regex it looks like: `qr /(?:th(?:ere\|at\|is)\|w?here)/` [download] Even by eliminating only ONE of the alternations (5 vs 4) a 10% speed-up was obtained. PS: I do not know about 5.10, this was tested in 5.8. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^4: Not sure what I am doing wrong by massa (Hermit) on Nov 26, 2008 at 14:34 UTC
Re: Not sure what I am doing wrong by DStaal (Chaplain) on Nov 24, 2008 at 14:57 UTC
What, exactly, do you think is going wrong? Yes, this looks like it is reading though the files, and checking each line of the files against each of your keys. Which seems to be both what you are trying to do and what you are worried about. I can see easier, and maybe faster ways to code some of the above (check out IO::Uncompress::Gunzip), but I'm not sure what you think the problem is.	[reply]
Re^2: Not sure what I am doing wrong by learningperl01 (Beadle) on Nov 24, 2008 at 15:48 UTC
Thanks for the reply, I guess its not so much a problem as I think there maybe a better way to search through the logs for each keywork(or hoping there is). The problem is that each log file is about 10-18mb in size compressed so searching for each 100 keyword against every single line in the log file takes a really long time. Takes about 40-50 minutes to search each log file for the 100 keywords. Not sure if there is a way to search faster?	[reply]
Re^3: Not sure what I am doing wrong by DStaal (Chaplain) on Nov 24, 2008 at 16:28 UTC
40-50 minutes per log file? Yeah, you should be able to speed that up... (Dependent on hardware of course, but my Mail::Log::Parse modules go through 8-9MB (after compression) logfiles in around a minute per (compressed) megabyte, and they do a lot more work than this.) The other comments on combining patterns could be useful, but the other thing to look at is what exactly your patterns are. Small differences in patterns can make some fairly major differences in speed. (Of course, if you are doing literals it's probably not that bad.)	[reply]
Re: Not sure what I am doing wrong by massa (Hermit) on Nov 24, 2008 at 14:59 UTC
sub queries() { if ( -f and /^ftr001/ ) { print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ \|") or die "gunzip: $!"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open +records file: $!"; my @REC=<RECORDS>; chomp @REC; my $RE = join '\|', map { quotemeta $_ } @REC; $RE = qr/($RE)/; while ( <UNZIP> ) { print "Searching for: $1\nFound it in: $_\n" if /$RE/ } } } [download] []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l]
Re: Not sure what I am doing wrong by Bloodnok (Vicar) on Nov 24, 2008 at 15:02 UTC
The first thing you should do is to get the `close()` calls in the right context... #!/usr/bin/perl use strict; use warnings; use File::Find; my $DIR = "/tmp/logs/"; find(\&queries, $DIR); sub queries() { if ( -f and /^ftr001/ ) { print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ \|") or die "gunzip: $!"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open +records file: $!"; my @REC=<RECORDS>; chomp @REC; close RECORDS; while ( <UNZIP> ) { my $EACH_LINE = $_; foreach my $REC_QUERY (@REC) { print "Searching for current record: $REC_QUERY\n"; if ( $EACH_LINE =~ /$$REC_QUERY$/ ) { print "Querying for record: $REC_QUERY\n"; print "$EACH_LINE\n"; } } } close UNZIP; } } [download] ...in that way you'll be closing the files at the right time i.e. after you've finished with them. Otherwise you'll encounter problems... After that, if I've understood you correctly, the simplest approach would be to convert the sought after records into an RE and use that against each file. Moreover, there's a marked performance hit by opening the records file once for every directory - c/w once at the outset - something along the lines of the following (untested) snippet... #!/usr/bin/perl use strict; use warnings; use File::Find; my $DIR = "/tmp/logs/"; find(\&queries, $DIR); open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records +file: $!"; local $/ = undef; # Ignoring binmode as its a text file my $RE = join '\|', <RECORDS>; close RECORDS; sub queries() { return unless -f and /^ftr001/; print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ \|") or die "gunzip: $!"; while ( <UNZIP> ) { next unless /$RE/; print $_; } close UNZIP; } [download] A user level that continues to overstate my experience :-))	[reply] [d/l] [select]
Re^2: Not sure what I am doing wrong by learningperl01 (Beadle) on Nov 24, 2008 at 16:11 UTC
Thanks for the quick reply. I used the code above which looks to be correct and I get the following error message? Not 100% sure why that error is coming up. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11802. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11803. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11804. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11805. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11806. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11807. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11808. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11809. [download] Did you mean to put the following out of the sub()? `open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records +file: $!"; local $/ = undef; # Ignoring binmode as its a text file my $RE = join '\|', <RECORDS>; close RECORDS;` [download] I tried moving the stuff above inside the sub and printed the $RE var and got the correct records but got no results. Not sure what the problem is. All in all what I am trying to figure out if there is a faster way to search for each of the 100 key entries in my file against all the logs files. The current problem(if thats what you want to call it) is that it takes almost 1 hour to search for 100 keywords against each of the log files which are about 15-18 MB in size. Thanks again for all the HELP!!	[reply] [d/l] [select]
Re^3: Not sure what I am doing wrong by Bloodnok (Vicar) on Nov 24, 2008 at 16:37 UTC
Yep, moving that block saves an awful lot of time - as I said, with the block moved, you open the `records.txt` once only - c/w once per file. As far as the uninit value is concerned, try... `use File::Find; my $DIR = "/tmp/logs/"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records +file: $!"; local $/ = undef; # Ignoring binmode as its a text file my $RE = join '\|', <RECORDS>; close RECORDS; find(\&queries, $DIR);` [download] Oops, sorry, my fault - as shown, the `$RE` lexical variable wasn't defined until after use (by the `queries` sub). A user level that continues to overstate my experience :-))	[reply] [d/l] [select]
Re: Not sure what I am doing wrong by f00li5h (Chaplain) on Nov 24, 2008 at 15:02 UTC
Update you can skip ths node if you've read Re: Not sure what I am doing wrong by massa depending on the data, and what you're seaching for, you can make one pattern, by using `$regex = join '\|', map quotemeta, @REC; print "I want $EACH_LINE" if $EACH_LINE =~ $regex` [download] It will get you a pattern made up of all the lines in the file, joined by `\|` which will match any of them It may need anchoring if the lines are to match the full line instead Also, the quotemeta presumes you just want literal strings (remove it if `@REC` contains regexen) `@_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;`	[reply] [d/l] [select]