learningperl01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello wondering if someone could help me out. I have the following script that I am using to search for about 100 records listed in a file against a bunch of log files. The problem is that I think it's searching each line for all 100 records. Not sure what I am doing wrong here hopefully someone can point me in the write direction. Once again here is what I am trying to do. I have a directory full of log files and I want to search through all of those files for matching entries from another file that contains key words. Thanks and please let me know if you have any questions. Thanks again for the help!!
#!/usr/bin/perl use strict; use warnings; use File::Find; my $DIR = "/tmp/logs/"; find(\&queries, $DIR); sub queries() { if ( -f and /^ftr001/ ) { print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open +records file: $!"; my @REC=<RECORDS>; chomp @REC; while ( <UNZIP> ) { my $EACH_LINE = $_; foreach my $REC_QUERY (@REC) { print "Searching for current record: $REC_QUERY\n"; if ( $EACH_LINE =~ /\($REC_QUERY\)/ ) { print "Querying for record: $REC_QUERY\n"; print "$EACH_LINE\n"; } } } } } close(UNZIP); close(RECORDS);

Replies are listed 'Best First'.
Re: Not sure what I am doing wrong
by CountZero (Bishop) on Nov 24, 2008 at 16:50 UTC
    Contrary to what my fellow Monks are telling, it is not always a good idea to join your single regex-patterns with |and then use this combined regex on all your tests. Alternations in a regex are processed in O(n) time so increasing the number of alternations has a linear effect on your processing time.

    To start with if you do

    my $RE = join '|', <RECORDS>;
    and then later
    /$RE/;
    for every test, Perl will have to re-compile your pattern every time. It is much better to pre-compile the pattern, by doing
    my $RE = join '|', <RECORDS>; my $compiled_RE = qr/$RE/;
    and then later
    /$compiled_RE/;
    In that case you only pay the compilation cost once.

    Even better is it to use a module such as Regexp::Assemble which can automatically compile for you a more efficient regex. Cutting down on the number of alternations will lead to a faster regex. I tried this in a program of mine and by using Regexp::Assemble managed to get a 30% speed increase when using real world data.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      As his original post had stated that he wanted to look for certain strings, not patterns, I think that the
      chomp(my @RE = <RECORDS>); my $RE = join '|', map quotemeta, @RE; $RE = qr/($RE)/;
      will possibly be the most efficient matcher (at least on 5.10 ...), no?
      []s, HTH, Massa (κς,πμ,πλ)
        Even then, the less alternations you have the faster your regex will run. Of course if the strings or patterns have nothing in common it makes no difference.

        I did a quick test and got the following results:

        Rate Raw Regex Assembled Regex Raw Regex 614/s -- -9% Assembled Regex 676/s 10% --
        The text to check was part of a play by Shakespeare, loaded into an array (1 line per element) and the regex used was
        qr /this|that|here|there|where/;
        Assembled into a more efficient regex it looks like:
        qr /(?:th(?:ere|at|is)|w?here)/
        Even by eliminating only ONE of the alternations (5 vs 4) a 10% speed-up was obtained.

        PS: I do not know about 5.10, this was tested in 5.8.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Not sure what I am doing wrong
by DStaal (Chaplain) on Nov 24, 2008 at 14:57 UTC

    What, exactly, do you think is going wrong? Yes, this looks like it is reading though the files, and checking each line of the files against each of your keys. Which seems to be both what you are trying to do and what you are worried about.

    I can see easier, and maybe faster ways to code some of the above (check out IO::Uncompress::Gunzip), but I'm not sure what you think the problem is.

      Thanks for the reply, I guess its not so much a problem as I think there maybe a better way to search through the logs for each keywork(or hoping there is). The problem is that each log file is about 10-18mb in size compressed so searching for each 100 keyword against every single line in the log file takes a really long time. Takes about 40-50 minutes to search each log file for the 100 keywords. Not sure if there is a way to search faster?

        40-50 minutes per log file? Yeah, you should be able to speed that up... (Dependent on hardware of course, but my Mail::Log::Parse modules go through 8-9MB (after compression) logfiles in around a minute per (compressed) megabyte, and they do a lot more work than this.)

        The other comments on combining patterns could be useful, but the other thing to look at is what exactly your patterns are. Small differences in patterns can make some fairly major differences in speed. (Of course, if you are doing literals it's probably not that bad.)

Re: Not sure what I am doing wrong
by massa (Hermit) on Nov 24, 2008 at 14:59 UTC
    sub queries() { if ( -f and /^ftr001/ ) { print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open +records file: $!"; my @REC=<RECORDS>; chomp @REC; my $RE = join '|', map { quotemeta $_ } @REC; $RE = qr/($RE)/; while ( <UNZIP> ) { print "Searching for: $1\nFound it in: $_\n" if /$RE/ } } }
    []s, HTH, Massa (κς,πμ,πλ)
Re: Not sure what I am doing wrong
by Bloodnok (Vicar) on Nov 24, 2008 at 15:02 UTC
    The first thing you should do is to get the close() calls in the right context...
    #!/usr/bin/perl use strict; use warnings; use File::Find; my $DIR = "/tmp/logs/"; find(\&queries, $DIR); sub queries() { if ( -f and /^ftr001/ ) { print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open +records file: $!"; my @REC=<RECORDS>; chomp @REC; close RECORDS; while ( <UNZIP> ) { my $EACH_LINE = $_; foreach my $REC_QUERY (@REC) { print "Searching for current record: $REC_QUERY\n"; if ( $EACH_LINE =~ /\($REC_QUERY\)/ ) { print "Querying for record: $REC_QUERY\n"; print "$EACH_LINE\n"; } } } close UNZIP; } }
    ...in that way you'll be closing the files at the right time i.e. after you've finished with them. Otherwise you'll encounter problems...

    After that, if I've understood you correctly, the simplest approach would be to convert the sought after records into an RE and use that against each file. Moreover, there's a marked performance hit by opening the records file once for every directory - c/w once at the outset - something along the lines of the following (untested) snippet...

    #!/usr/bin/perl use strict; use warnings; use File::Find; my $DIR = "/tmp/logs/"; find(\&queries, $DIR); open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records +file: $!"; local $/ = undef; # Ignoring binmode as its a text file my $RE = join '|', <RECORDS>; close RECORDS; sub queries() { return unless -f and /^ftr001/; print "----------------------------------------------\n"; print "Searching File:", $_, "\n"; print "----------------------------------------------\n"; open(UNZIP, "gunzip -dc < $_ |") or die "gunzip: $!"; while ( <UNZIP> ) { next unless /$RE/; print $_; } close UNZIP; }

    A user level that continues to overstate my experience :-))
      Thanks for the quick reply. I used the code above which looks to be correct and I get the following error message? Not 100% sure why that error is coming up.
      Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11802. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11803. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11804. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11805. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11806. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11807. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11808. Use of uninitialized value in regexp compilation at dns.pl line 23, <U +NZIP> line 11809.
      Did you mean to put the following out of the sub()?
      open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records +file: $!"; local $/ = undef; # Ignoring binmode as its a text file my $RE = join '|', <RECORDS>; close RECORDS;
      I tried moving the stuff above inside the sub and printed the $RE var and got the correct records but got no results. Not sure what the problem is. All in all what I am trying to figure out if there is a faster way to search for each of the 100 key entries in my file against all the logs files. The current problem(if thats what you want to call it) is that it takes almost 1 hour to search for 100 keywords against each of the log files which are about 15-18 MB in size. Thanks again for all the HELP!!
        Yep, moving that block saves an awful lot of time - as I said, with the block moved, you open the records.txt once only - c/w once per file.

        As far as the uninit value is concerned, try...

        use File::Find; my $DIR = "/tmp/logs/"; open(RECORDS, "/tmp/logs/records.txt") or die "Could not open records +file: $!"; local $/ = undef; # Ignoring binmode as its a text file my $RE = join '|', <RECORDS>; close RECORDS; find(\&queries, $DIR);
        Oops, sorry, my fault - as shown, the $RE lexical variable wasn't defined until after use (by the queries sub).

        A user level that continues to overstate my experience :-))
Re: Not sure what I am doing wrong
by f00li5h (Chaplain) on Nov 24, 2008 at 15:02 UTC

    Update you can skip ths node if you've read Re: Not sure what I am doing wrong by massa

    depending on the data, and what you're seaching for, you can make one pattern, by using

    $regex = join '|', map quotemeta, @REC; print "I want $EACH_LINE" if $EACH_LINE =~ $regex

    It will get you a pattern made up of all the lines in the file, joined by | which will match any of them

    It may need anchoring if the lines are to match the full line instead

    Also, the quotemeta presumes you just want literal strings (remove it if @REC contains regexen)

    @_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;