comment on

Let me guess...you tested this on 5.005, not 5.6?

I was so flabbergasted by those results that I just had to test myself, and it turns out that there is a huge discrepency between the same code running on 5.005_3 vs 5.6

Here are my results. tye's neat coderef thing wasn't returning valid output ($1 was always undef since the one we wanted was localized to a different block), so I modified that slightly to get correct results. Also, all of the techniques tested assume only one match per line.

#!/usr/bin/perl -w
use strict;
use Benchmark;

use vars qw(@problem);
@problem = (
            "0  OBS",
            "AT LEAST",
            "EXTRANEOUS",
            "CARTESIAN",
            "CLOSING",
            "CONVERT",
            "DIVISION BY ZERO",
            "DOES NOT EXIST",
            "DUE TO LOOPING",
            "END OF MACRO",
            "ENDING EXECUTION",
            "ERROR",
            "ERRORABEND",
            "ERRORCHECK=STRICT",
            "EXCEED",
            "HANGING",
            "HAS 0 OBSERVATIONS",
            "ILLEGAL",
            "INCOMPLETE",
            "INVALID",
            "LOST CARD",
            "MATHEMAT",
            "MERGE STATEMENT",
            "MISSING",
            "MULTIPLE",
            "NOT FOUND",
            "NOT RESOLVED",
            "OBS=0",
            "REFERENCE",
            "REPEAT",
            "SAS CAMPUS DRIVE",
            "SAS SET OPTION OBS=0",
            "SAS WENT",
            "SHIFTED",
            "STOP",
            "TOO SMALL",
            "UNBALANCED",
            "UNCLOSED",
            "UNREF",
            "UNRESOLVED",
            "WARNING"
           );

open FOO, ">/dev/null" or die $!;

timethese (5, {NO_REGEX => \&no_regex,
                CODE_REGEX => \&code_regex,
                BIG_REGEX => \&big_regex,
                MANY_REGEXES => \&many_regexes
               });

sub no_regex {
    local @ARGV = @ARGV;

    while(<>) {
        my $up= uc $_;
        foreach my $p (  @problem  ) {
            if(  0 <= index($up,$p)  ) {
                print FOO "line $.: problem: $p\n$_\n";
                last;
            }
        }
    }
}

sub big_regex {
    local @ARGV = @ARGV;
    my $match = ret_match_any(@problem);

    while(<>) {
        if ($_ =~ $match) {
            print FOO "line $.: problem: $1\n$_\n";
        }
    }
}

sub ret_match_any {
    # same as tilly's original
}

sub trie_strs {
    # same as tilly's original
}

sub many_regexes {
    local @ARGV = @ARGV;
    local @problem = map {qr/(\Q$_\E)/i} @problem;

    while (<>) {
        for my $p (@problem) {
            print FOO "line $.: problem: $1\n$_\n" and last if /$p/;
        }
    }
}

sub code_regex {
    local @ARGV = @ARGV;

    my $code= "sub { /("
      . join ")/i || /(", map {"\Q$_\E"} @problem;
    $code .= ')/i and return $1}';

    my $match= eval $code;
    die "$@"   unless  ref($match)
      &&  UNIVERSAL::isa($match,"CODE");

    while(<>) {
        if( my $p = &$match() ) {
            print FOO "line $.: problem: $p\n$_\n";
        }
    }
}

__END__

5.6:
chh@scallop test> perl matchtest sample.txt
Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES,
+ NO_REGEX...
 BIG_REGEX: 90 wallclock secs (89.58 usr +  0.28 sys = 89.86 CPU) @  0
+.06/s (n=5)
CODE_REGEX: 53 wallclock secs (53.13 usr +  0.36 sys = 53.49 CPU) @  0
+.09/s (n=5)
MANY_REGEXES: 60 wallclock secs (59.27 usr +  0.28 sys = 59.55 CPU) @ 
+ 0.08/s (n=5)
  NO_REGEX: 44 wallclock secs (43.33 usr +  0.30 sys = 43.63 CPU) @  0
+.11/s (n=5)

5.005_3:
Benchmark: timing 5 iterations of BIG_REGEX, CODE_REGEX, MANY_REGEXES,
+ NO_REGEX...
 BIG_REGEX: 79 wallclock secs (77.08 usr +  0.61 sys = 77.69 CPU)
CODE_REGEX: 357 wallclock secs (354.97 usr +  0.64 sys = 355.61 CPU)
MANY_REGEXES: 363 wallclock secs (361.99 usr +  0.73 sys = 362.72 CPU)
  NO_REGEX: 43 wallclock secs (42.84 usr +  0.19 sys = 43.03 CPU)


The 10MB test file was generated thusly:

my @chars = map chr($_), 32..127;
open SAMPLE, ">sample.txt" or die $!;
while (-s SAMPLE < 1024*1024*10) {
    my $line = join '', map { $chars[rand @chars] } 1..100;
    substr $line, rand(100), 0, $problem[rand @problem];
    print SAMPLE "$line\n";
}
close SAMPLE;
[download]

Update: I just noticed that the stub above was using a different name for the @problem array, which makes it look as if I was generating all non-matching lines when I was really generating exactly one match per line.

In reply to RE: RE (tilly) 4: SAS log scanner by takshaka
in thread SAS log scanner by nop

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.