RE: RE (tilly) 1: SAS log scanner

Doing a great big regex with lots of /this|that|the other|etc/ is documented as probably being slower than /this/ || /that/ || /the other/ || /etc/ (or at least was documented -- this may have been dropped when perlre.pod was created, perhaps because it is no longer true). Of course, this will vary by case so a benchmark will tell you for sure.

So, based on my old bias, I'd code it one of these two ways:

@problem = (
"0  OBS",
"AT LEAST",
"EXTRANEOUS",
"CARTESIAN",
"CLOSING",
"CONVERT",
"DIVISION BY ZERO",
"DOES NOT EXIST",
"DUE TO LOOPING",
"END OF MACRO",
"ENDING EXECUTION",
"ERROR",
"ERRORABEND",
"ERRORCHECK=STRICT",
"EXCEED",
"HANGING",
"HAS 0 OBSERVATIONS",
"ILLEGAL",
"INCOMPLETE",
"INVALID",
"LOST CARD",
"MATHEMAT",
"MERGE STATEMENT",
"MISSING",
"MULTIPLE",
"NOT FOUND",
"NOT RESOLVED",
"OBS=0",
"REFERENCE",
"REPEAT",
"SAS CAMPUS DRIVE",
"SAS SET OPTION OBS=0",
"SAS WENT",
"SHIFTED",
"STOP",
"TOO SMALL",
"UNBALANCED",
"UNCLOSED",
"UNINITIALIZED",
"UNREF",
"UNRESOLVED",
"WARNING"
);


# First way:
my $code= "sub { /("
  . join ")/i || /(", map {"\Q$_\E"} @problem;
$code .= ")/i }";
my $match= eval $code;
die "$@"   unless  ref($sub)
               &&  UNIVERSAL::isa($sub,"CODE");
while(<>) {
    if(  &$match()  ) {
        print "line $.: problem: $1\n$_\n";
    }
}


# Second way:
while(<>) {
    my $up= upcase $_;
    foreach my $p (  @problem  ) {
        if(  0 <= index($up,$p)  ) {
            print "line $.: problem: $p\n$_\n";
            last;
        }
    }
}
[download]

Both of my solutions differ from the original in that they only report one problem per line. My "second way" can be made like the original by simply removing the "last;" line. I see no simple way to make my "first way" like the original.

I apologize, but I didn't feel up to creating a test input file so I didn't test nor benchmark. I'd be interested in seeing test and benchmark results on real data.

- tye (but my friends call me "Tye")

Comment on RE: RE (tilly) 1: SAS log scanner Select or Download Code

Replies are listed 'Best First'.
RE (tilly) 3: SAS log scanner by tilly (Archbishop) on Sep 02, 2000 at 03:47 UTC
Gosh dang, but you are right. I just finished creating my sophisticated REs, saw your response, tried both of your versions. Your first one crashed, but your second ran just fine significantly faster than mine. Mine outran the original by a good factor as well. (I used the example code as a logfile to test. :-) I guess the constant string search really is a nice win. Thanks for that lesson! :-)	[reply]
RE: RE: RE (tilly) 1: SAS log scanner by takshaka (Friar) on Sep 03, 2000 at 00:15 UTC
I don't doubt that index() is faster than regexes; but it seems that, for regexes, the coderef approach would be slower than a qr// method: `@problem = map [$_, qr/\Q$_/i], @problem; while (<>) { for my $p (@problem) { print "line: $. problem: $p->[0]\n" if /$p->[1]/; } }` [download] But I'm also too lazy to generate a good input file for benchmarking.	[reply] [d/l]
RE (tilly) 4: SAS log scanner by tilly (Archbishop) on Sep 03, 2000 at 00:41 UTC
For an input problem, you could just use the input script again. That is what I did. Of course then you are really testing the time to compile code... In any case there is overhead to entering a block, and to a hash lookup. Therefore I believe that tye's approach really is faster than your loop. But I went and tested it and found that between the three approaches the difference was all in the compilation. So I went for a longer file (a bit over 4 MB) and found that I was a bit under 27 seconds, tye just over 35, and yours 2 min, 22 seconds. I think tye's is faster than yours, and mine was not so bad after all. :-) That said, perhaps I am not so ashamed to show the sophisticated method of building the RE, that coerces the RE engine into an optimization it should know about but does not (yet): use strict; my $match = &ret_match_any( "0 OBS", "AT LEAST", "EXTRANEOUS", "CARTESIAN", "CLOSING", "CONVERT", "DIVISION BY ZERO", "DOES NOT EXIST", "DUE TO LOOPING", "END OF MACRO", "ENDING EXECUTION", "ERROR", "ERRORABEND", "ERRORCHECK=STRICT", "EXCEED", "HANGING", "HAS 0 OBSERVATIONS", "ILLEGAL", "INCOMPLETE", "INVALID", "LOST CARD", "MATHEMAT", "MERGE STATEMENT", "MISSING", "MULTIPLE", "NOT FOUND", "NOT RESOLVED", "OBS=0", "REFERENCE", "REPEAT", "SAS CAMPUS DRIVE", "SAS SET OPTION OBS=0", "SAS WENT", "SHIFTED", "STOP", "TOO SMALL", "UNBALANCED", "UNCLOSED", "UNREF", "UNRESOLVED", "WARNING" ); while(<>) { if ($_ =~ $match) { print "line $., file $ARGV, problem $1\n$_\n"; } } # Takes a list of strings and returns an RE that matches any. sub ret_match_any { my $match_str = &trie_strs(map quotemeta, @_); return qr /($match_str)/; } # Takes a list of escaped strings and returns a single string # suitable for building a match in an efficient way. # Works recursively by grouping strings that share one character sub trie_strs { unless (@_) { return (); } my %rest; foreach my $str (@_) { if (length($str)) { my $chr = substr($str, 0, 1); if ("\\" eq $chr) { $chr = substr($str, 0, 2); push @{$rest{$chr}}, substr($str, 2); } else { push @{$rest{$chr}}, substr($str, 1); } } else { $rest{''} = ['']; } } my @to_join; foreach my $chr (keys %rest) { my $list_ref = $rest{$chr}; if (1 < @$list_ref) { push @to_join, $chr . &trie_strs(@$list_ref); } else { push @to_join, $chr . $list_ref->[0]; } } if (1 < @to_join) { return '(?:' . (join '\|', @to_join) . ')'; } else { return $to_join[0]; } } [download]	[reply] [d/l]
RE: RE (tilly) 4: SAS log scanner by takshaka (Friar) on Sep 03, 2000 at 10:45 UTC
Let me guess...you tested this on 5.005, not 5.6? I was so flabbergasted by those results that I just had to test myself, and it turns out that there is a huge discrepency between the same code running on 5.005_3 vs 5.6 Here are my results. tye's neat coderef thing wasn't returning valid output ($1 was always undef since the one we wanted was localized to a different block), so I modified that slightly to get correct results. Also, all of the techniques tested assume only one match per line. #!/usr/bin/perl -w use strict; use Benchmark; use vars qw(@problem); @problem = ( "0 OBS", "AT LEAST", "EXTRANEOUS", "CARTESIAN", "CLOSING", "CONVERT", "DIVISION BY ZERO", "DOES NOT EXIST", "DUE TO LOOPING", "END OF MACRO", "ENDING EXECUTION", "ERROR", "ERRORABEND", "ERRORCHECK=STRICT", "EXCEED", "HANGING", "HAS 0 OBSERVATIONS", "ILLEGAL", "INCOMPLETE", "INVALID", "LOST CARD", "MATHEMAT", "MERGE STATEMENT", "MISSING", "MULTIPLE", "NOT FOUND", "NOT RESOLVED", "OBS=0", "REFERENCE", "REPEAT", "SAS CAMPUS DRIVE", "SAS SET OPTION OBS=0", "SAS WENT", "SHIFTED", "STOP", "TOO SMALL", "UNBALANCED", "UNCLOSED", "UNREF", "UNRESOLVED", "WARNING" ); open FOO, ">/dev/null" or die $!; timethese (5, {NO_REGEX => \&no_regex, CODE_REGEX => \&code_regex, BIG_REGEX => \&big_regex, MANY_REGEXES => \&many_regexes }); sub no_regex { local @ARGV = @ARGV; while(<>) { my $up= uc $_; foreach my $p ( @problem ) { if( 0 <= index($up,$p) ) { print FOO "line $.: problem: $p\n$_\n"; last; } } } } sub big_regex { local @ARGV = @ARGV; my $match = ret_match_any(@problem); while(<>) { if ($_ =~ $match) { print FOO "line $.: problem: $1\n$_\n"; } } } sub ret_match_any { # same as tilly's original } sub trie_strs { # same as tilly's original } sub many_regexes { local @ARGV = @ARGV; local @problem = map {qr/(\Q$_\E)/i} @problem; while (<>) { for my $p (@problem) { print FOO "line $.: problem: $1\n$_\n" and last if /$p/; } } } sub code_regex { local @ARGV = @ARGV; my $code= "sub { /(" . join ")/i \|\| /(", map {"\Q$_\E"} @problem; $code .= ')/i and return $1}'; my $match= eval $code; die "$@" unless ref($match) && UNIVERSAL::isa($match,"CODE"); while(<>) { if( my $p = &$match() ) { print FOO "line $.: problem: $p\n$_\n"; } } } __END__ 5.6: chh@scallop test> perl matchtest sample.txt Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES, + NO_REGEX... BIG_REGEX: 90 wallclock secs (89.58 usr + 0.28 sys = 89.86 CPU) @ 0 +.06/s (n=5) CODE_REGEX: 53 wallclock secs (53.13 usr + 0.36 sys = 53.49 CPU) @ 0 +.09/s (n=5) MANY_REGEXES: 60 wallclock secs (59.27 usr + 0.28 sys = 59.55 CPU) @ + 0.08/s (n=5) NO_REGEX: 44 wallclock secs (43.33 usr + 0.30 sys = 43.63 CPU) @ 0 +.11/s (n=5) 5.005_3: Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES, + NO_REGEX... BIG_REGEX: 79 wallclock secs (77.08 usr + 0.61 sys = 77.69 CPU) CODE_REGEX: 357 wallclock secs (354.97 usr + 0.64 sys = 355.61 CPU) MANY_REGEXES: 363 wallclock secs (361.99 usr + 0.73 sys = 362.72 CPU) NO_REGEX: 43 wallclock secs (42.84 usr + 0.19 sys = 43.03 CPU) The 10MB test file was generated thusly: my @chars = map chr($_), 32..127; open SAMPLE, ">sample.txt" or die $!; while (-s SAMPLE < 1024102410) { my $line = join '', map { $chars[rand @chars] } 1..100; substr $line, rand(100), 0, $problem[rand @problem]; print SAMPLE "$line\n"; } close SAMPLE; [download] *Update:* I just noticed that the stub above was using a different name for the @problem array, which makes it look as if I was generating all non-matching lines when I was really generating exactly one match per line.	[reply] [d/l]
RE: RE: RE (tilly) 4: SAS log scanner by tye (Sage) on Sep 04, 2000 at 00:01 UTC
RE (tilly) 6: SAS log scanner by tilly (Archbishop) on Sep 03, 2000 at 18:08 UTC