Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Regex and question of design

by amaguk (Sexton)
on Apr 14, 2005 at 09:21 UTC ( [id://447672]=perlquestion: print w/replies, xml ) Need Help??

amaguk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I've a quite large file, and I want to apply a set of regex on each line of this file. I think build an array of regex, and for each line of my file, apply the array of regex (with this construction, it's easy to add a new rule). But maybe is there a perlish way to do ? What do you think about ?

Replies are listed 'Best First'.
Re: Regex and question of design
by grinder (Bishop) on Apr 14, 2005 at 10:43 UTC
    I want to apply a set of regex on each line of this file

    At the risk of tiring readers with yet another plug for my module, I'll point out that it offers a "tracked pattern" mode, whereby you can assemble your array of regexps into a single pattern, which gives you the efficiency of performing a single match, and the convenience of being able to determine which, of the original expressions, was the one that matched. The code would go something like:

    use strict; use Regexp::Assemble; my $re = Regexp::Assemble->new( track => 1 ); open IN, shift || 'patterns' or die "open pattern file: $!\n"; while( <IN> ) { chomp; $re->add($_); } close IN; # read from, e.g., STDIN while( <> ) { chomp; if( defined( my $match = $re->match($_)) ) { print " $_ matched by $match\n"; } }

    What is not obvious, is that behind the scenes, the $re->match() is performing a single match. It is not looping over the entire list of patterns. It has to be this way (rather than using the more intuitive if( /$re/ ) {...}) because of current broken behaviour in the regular expression engine (see bug #32840 for details).

    Printing out the which particular pattern caused the match is not particularly helpful. What you really want to do is use the result as a hash key, either to look up a "human-readable" result string or a callback function, whatever suits your needs. If your patterns have captures (e.g. /x=(\d+)/), they are available for use.

    - another intruder with the mooring in the heart of the Perl

      which gives you the efficiency of performing a single match,
      A single match is not always more efficient. A bunch of simple matches can be more efficient than a single, more complicated, match. And that's because of the Perl optimizer. Here's an example:
      #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; use Perl6::Slurp; use Benchmark qw /cmpthese/; use Test::More tests => 1; our @data = slurp '/usr/share/dict/words'; our(@a, @b); cmpthese(-10, { regex => '@a = grep {/qu/ || /x/} @data', ra => 'my $re = Regexp::Assemble->new->add("qu")->add("x")->re; @b = grep {/$re/} @data', }); is_deeply(\@a, \@b); __END__ 1..1 Rate ra regex ra 4.71/s -- -65% regex 13.5/s 186% -- ok 1
        A bunch of simple matches can be more efficient than a single, more complicated, match.

        True, but I must take you up on two issues.

        Firstly, you are paying for the cost of the construction of the assembled pattern each time through the loop. In practise one would do this only once per run. Hoisting that out of the benchmarked code would make the figures more accurate.

        Secondly, I wouldn't bother with such an approach for two patterns. It only starts to come into its own for a larger number. Where the sweet spot lies, I don't know... my educated guess is more than 10, less than 20.

        But even when you have as few as ten patterns, you have to start worrying about putting /foobar/ before /foo/. Failing to do so will result in 'foobar' never being matched ('foo' will succeed instead). If you have /bin/, /bat/, /bar/, /bong/, ... it is rather wasteful to match against all four and still have it fail just because the target string happens to be is 'bone'. That is what I meant when I talked of efficiency.

        - another intruder with the mooring in the heart of the Perl

      grinder++, What a great module.

      When you use tracking to see which regex matched does it still compile to one regex internaly ? The docs say:

      track(0|1)
      Turns tracking on or off. When this attribute is enabled, additional housekeeping information is inserted into the assembled expression using ({...} embedded code constructs. This provides the necessary information to determine which, of the original patterns added, was the one that caused the match.
      $re->track( 1 ); if( $target =~ /$re/ ) { print "$target matched by ", $re->matched, "\n"; }
      Note that when this functionality is enabled, no reduction is performed and no character classes are generated. In other words, brag|tag is not reduced down to (?:br|t)ag<code> and dig|dim is not reduced to <code>di[gm].

      so I infer it it not as optimised as a non tracking version but still better than the looping solution ?

      If there are two regexen in the list that match does it return the first, last, all or an indeterminate selection of the above ?

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!
        When you use tracking to see which regex matched does it still compile to one regex internaly

        Yes.

        not as optimised as a non tracking version but still better than the looping solution ?

        It has the same behaviour insofar as at any point during a tracked pattern match, the "degrees of freedom" the engine has available to try is the same as for an ordinary assembled pattern. It's just that the tracked pattern is stuffed full of (?{...}) zero-width eval assertions.

        If there are two regexen in the list that match does it return the first, last, all or an indeterminate selection of the above ?

        For a given target, the same path through the pattern will always be followed. In that regard it is perfectly deterministic, it is just that it is sometimes hard to determine in advance what that will be. It sort of makes sense if you squint hard enough. Consider:

        #! /usr/local/bin/perl -w use strict; use Regexp::Assemble; my $re = Regexp::Assemble->new(track => 1) # remember to double up your backslashes ->add( '^X\\d+' ) ->add( '^X\\d\\d*' ) ->add( '^\\s*X\\d\\d*' ) ->add( '^X\\d\\d' ) ->add( '^X\\d' ) ; while( <DATA> ) { chomp; print $re->matched, " matched <$_>\n" if $re->match($_); } __DATA__ XY1 X234 X56 X4 Z0 X77

        This produces:

        ^X\d\d* matched <X234> ^X\d\d* matched <X56> ^X\d\d* matched <X4> ^\s*X\d\d* matched < X77>

        But I would stress above all that the list of patterns in this case is in need of reformulation anyway. Hmm, in fact, this is very interesting. With a suitably exhaustive population of target test stings, you could use this approach to weed out "can't happen" patterns.

        Thank-you for asking this question! I might recycle some of the ideas in this thread into examples for the module distribution.

        - another intruder with the mooring in the heart of the Perl

      Very interesting (I'm just finishing to read your README). I think that I'll program two scripts, one with my original idea and one with your module. So I can learn and understand more ;-)
Re: Regex and question of design
by Random_Walk (Prior) on Apr 14, 2005 at 09:42 UTC

    If you have a few regex to apply to each line an array of pre-compiled regex is about the most optimimum you are going to get if you need to know which regex matched. If you only care about the fact one did match you could construct one super regex to rule them all but that could be dificult.

    #!/usr/bin/perl use strict; use warnings; my @regex; while (<DATA>) { chomp; push @regex, qr/$_/; } while (my $line = <STDIN>) { for (0..$#regex) { print "matched no: $_\n" if $line=~/$regex[$_]/; } } __DATA__ foo bar baz

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
      I've already quite the same script ;-)
      Thanks to remember me the existence of qr// !

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://447672]
Approved by Corion
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-20 04:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found