Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Finding out which of a list of patterns matched

by lemnisca (Sexton)
on Jan 22, 2006 at 10:47 UTC ( [id://524768]=perlquestion: print w/replies, xml ) Need Help??

lemnisca has asked for the wisdom of the Perl Monks concerning the following question:

I have a list of patterns which I am trying to match against a piece of text. What I want to find is the pattern earliest in the list which matches earliest in the text, where matching earlier in the text is more important than being higher up the list. Perl will do this for me if I do something like /foo|bar|baz/, however this leaves me with a dilemma. I need to know which of the patterns matched, because the action I take next depends on which pattern was successful.

One way I could do this would be to use a pattern similar to the above and then take the text that matched and check it against all of my patterns in order until I find the one it matched. However, this means I am trying to match the same pattern against the same text multiple times, which doesn't really seem like it'd be a good idea. I've also thought about looping through each pattern in turn but I think that would be even worse than the first way. I am wondering if there is a better way to accomplish this? If there isn't I guess I'll just do the match using the alternation but I'd hate to be redundantly re-testing patterns when I don't need to. I'll potentially have quite a lot of patterns to test and quite a lot of text, and I need to find as many matches as possible, so it's worthwhile doing things efficiently. :)
  • Comment on Finding out which of a list of patterns matched

Replies are listed 'Best First'.
Re: Finding out which of a list of patterns matched
by nobull (Friar) on Jan 22, 2006 at 11:06 UTC
    $#- gives the number of the last capture in the last regex that was successful. With simple alternation as in this case you can just make each alternative a caputure.
    if ( /(foo)|(bar)|(baz)/ ) { print "$#-\n"; }
    For details, see the description of @- in perlvar.

      In your example, $#- will always be 1, because the conditional is fulfilled after the first match. Also, $#- can't really be useful in this matter at all, because it returns the index of the last match in @- ,which in turn gives you the match offset into the string, not the matched content.

      Update: This is completely wrong and nobull is correct, my apologies (hangs head in shame). To make up for it, here's a practical example of finding your matched word with nobull's method (I still like my method below better, but YMMV):

      my @words = qw(foo bar baz); my $regex = join(")|(",grep { quotemeta($_) } @words); if ($input =~ m/($regex)/ ){ my $index = $#- -1; print "matched word $words[$index]\n"; }

      There are ten types of people: those that understand binary and those that don't.
        I still like my method below better <snip>

        I don't. The OP specifically states "I'll potentially have quite a lot of patterns to test". What if "quite a lot" means, say, 50, or 500, or 5,000? Your dispatch table could become unmanageably long...

Re: Finding out which of a list of patterns matched
by grinder (Bishop) on Jan 22, 2006 at 11:14 UTC
    One way I could do this would be to use a pattern similar to the above and then take the text that matched and check it against all of my patterns in order until I find the one it matched

    That sounds like exactly the same problem I had once, which led me to write the "tracked" pattern mode of Regexp::Assemble. Take a look at that, I think it will do just what you want.

    In essence: you take all your patterns, assemble them into one pattern with tracking enabled, and then look for matches with that. When you get a match (using that one pattern), you can then retrieve the original pattern that would have matched in its place.

    There are a couple of scripts in the eg/ directory that should help get you started.

    • another intruder with the mooring in the heart of the Perl

Re: Finding out which of a list of patterns matched
by Not_a_Number (Prior) on Jan 22, 2006 at 12:14 UTC
    while ( <DATA> ) { if ( /(foo)|(bar)|(baz)/ ) { print "Found $+ line $.\n"; last; } } __DATA__ Stuff The quick brown baz jumps over the lazy bar foo

    Update: Here's a complete programme. Enter the name(s) of the text file(s) on the command line.

    use strict; use warnings; die "Usage: $0 filename(s)\n" unless @ARGV; my @to_match = qw ( foo bar baz 43 q\uux ); # or whatever my $match = join ')|(', map quotemeta, @to_match; while ( <> ) { if ( /($match)/ ) { print "Found '$+' in $ARGV line $.\n"; close ARGV; } $. = 0 if eof; }
Re: Finding out which of a list of patterns matched
by tirwhan (Abbot) on Jan 22, 2006 at 11:10 UTC

    You could capture the match (read perldoc perlre on capturing brackets) and use the capture result to determine your program flow. One elegant way of doing this is to use a hash with the possible matches as hash keys, and the corresponding actions as the hash values in a subroutine reference (this is commonly called a "dispatch table"):

    #!/usr/bin/perl use warnings; use strict; use Carp; my $input = shift or croak "Give me an argument, fool!\n"; my %action = ( foo => sub { print "fooing\n" }, bar => sub { print "barred\n" }, baz => sub { print "all bazzed out\n" }, ); if (my ($match) = $input =~ m/(foo|bar|baz)/ ){ &{$action{$match}}; } else { croak "Dunno what to do\n"; }

    There are ten types of people: those that understand binary and those that don't.
      Unfortunately I'm not sure that this will be able to work with my patterns. I'll be reading the patterns in from another file, not creating them myself, and they could be arbitrarily complex - not necessarily a simple word to match like 'foo' which means I don't think the matching text will be much good for a key to a hash. Sorry if I misled you on that point in my original post - I should have given a better example of the pattenrs.

        The unasked question here is: Once you have the match, how will you determine what action to take?

        If you are reading the patterns to be matched from a separate file prepared by someone else, how do they indicate what action should be taken? Or how do you decide what action should be taken for each pattern matched?

        could be arbitrarily complex - ... I don't think the matching text will be much good for a key to a hash.

        The matching text may not be a good key as it wouldn't match a hash key that was a non-constant regex, but using $#- will tell you which pattern was matched, and could be used as an index to an array containing the original patterns to retrieve the one that matched (rather than the text it matched).

        Eg.

        my @patterns = slurp( 'file' ); my %dispatch; @dispatch{ @patterns } = ( ??? ); my $regex = '(', join( ')|(', @patterns ) . ')'; while( my $data = <DATA> ) { if( $data =~ $regex ) { $dispatch{ $patterns[ $#- ] }->( $1 ); } }

        The problem with the above code is what to substitute for '???'. Ie. What action is required for matches against each pattern. But that problem exists whether you are using a dispatch table; an if/else cascade or any other mechanism. Of the choices available the dispatch table is by far the easiest--if not the only--option for the dynamic search patterns you describe.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding out which of a list of patterns matched
by lemnisca (Sexton) on Jan 23, 2006 at 00:53 UTC
    Thanks to everyone that replied. Your help has been invaluable and I've learnt quite a bit from reading these answers. :)

    I think I'll go with nobull's method as it seems to be the simplest to do what I want.
Re: Finding out which of a list of patterns matched
by duff (Parson) on Jan 22, 2006 at 20:33 UTC
    What I want to find is the pattern earliest in the list which matches earliest in the text, where matching earlier in the text is more important than being higher up the list.

    Does that mean that if multiple patterns match the text you want the one that matches at the point closest to the start of the string? Because, if so, none of the solutions given so far will do that that I can see. Even your suggested use of /foo|bar|baz/ won't do this as perl tries the patterns from left to right and gives you the one that matches first whether or not it's "earliest" in the string. I think that if you want this behavior, you'll have to loop over each pattern recording where they matched (if at all) and then select one with the lowest match position.

      Are you sure of that? I'm only fairly new to Perl so I could be interpreting this wrongly, but in the Camel book (which I've been reading) it says:

      "...But the ordering of the alternatives only matters at a given position. The outer loop of the Engine does left-to-right matching, so the following always matches the first Sam:"
      "'Sam I am,' said Samwise" =~ /(Samwise|Sam)/; # $1 eq "Sam"
      which suggests to me that the match earliest in the text will be found. If that isn't the case it's going to make my life a whole lot more difficult...:P

        Ho, ho, you are right! I don't know what I was thinking.

Re: Finding out which of a list of patterns matched
by adamk (Chaplain) on Jan 22, 2006 at 22:09 UTC
    Have a look at the internals of HTML::TrackerLink. It does something like what you want possibly.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://524768]
Approved by Corion
Front-paged by traveler
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-25 12:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found