in reply to Re: Pattern match array
in thread Pattern match array

You probably want to use word boundary anchors so that you don't get false positives with fprintf when looking for int etc.

I agree. In other words, break the text into words before attempting to match entire words. So all the regex engine is doing is replicating an inner for-loop with an eq test.

while ( <> ) { for my $word ( m{\b(\w+)\b}g ) { if (grep {$word eq $_} @prims) { print qq{Found $word on line $.\n}; } } }

Update: which, of course, vindicates thezip's hash-based solution in the first response to this question.

Replies are listed 'Best First'.
Re^3: Pattern match array
by johngg (Canon) on May 08, 2008 at 13:52 UTC
    Letting the regex alternation do the heavy lifting seems to be a bit faster. Tested with a 1235 line C program cat'ed together 20 times.

    use strict; use warnings; use Benchmark q{cmpthese}; my @prims = qw{ int char long double static }; my $inFile = q{xxx.c}; open my $inFH, q{<}, $inFile or die qq{open: $inFile: $!\n}; my $outFile = q{/dev/null}; open my $outFH, q{>}, $outFile or die qq{open: $outFile: $!\n}; cmpthese( -10, { JohnGG => sub { seek $inFH, 0, 0; my $rxPrims = do { local $" = q{|}; qr{\b(@prims)\b}; }; while ( <$inFH> ) { next unless my @found = m{$rxPrims}g; print $outFH qq{Found @found on line $.\n}; } }, Narveson => sub { seek $inFH, 0, 0; while ( <$inFH> ) { for my $word ( m{\b(\w+)\b}g ) { if (grep {$word eq $_} @prims) { print $outFH qq{Found $word on line $.\n}; } } } }, } ); close $inFH or die qq{close: $inFile: $!\n}; close $outFH or die qq{close: $outFile: $!\n};

    The benchmark output.

    Rate Narveson JohnGG Narveson 1.39/s -- -63% JohnGG 3.78/s 173% --

    I hope this is of interest.

    Cheers,

    JohnGG

      Thanks, this is of interest.

      I think if I had known benchmarks would be run, I would have hashed instead of grepping.

      my %sought = map {$_ => 1} @prims;

      and later

      if ($sought{$word})

      And then there's List::MoreUtils::any, which would at least quit on the first match instead of checking the rest of the list.

        Using a hash instead of grepping does improve performance but the regex alternation still seems to retain the advantage.

        Rate Narveson Narveson2 JohnGG Narveson 1.39/s -- -30% -64% Narveson2 1.99/s 43% -- -48% JohnGG 3.81/s 174% 91% --

        Cheers,

        JohnGG