Hello my good Monks. Based on a question in irc.freenode.net/#perl, I created a simple pattern matching routine. It takes patterns with names match tokens between curly brackets, such as {foo}/{bar}-{baz}, and the returns a hash containing the token names as hash keys, and the matched strings as values.

It is fairly trivial, and the code to implement it is quite simple. But before I blow this away, I was curious if there were any CPAN modules to accomplish the same task. I would be surprised if there weren't, but if not, perhaps I should make one? It seems vaguely useful, and might deserve somewhere to live. Any thoughts welcomed.

Here is the code as it currently stands:

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; print Dumper({ named_match("{foo} {bar}", "one two" ) }); print Dumper({ named_match("{baz} - {qux}", "three - four - five") }); sub named_match { my ($pattern, $string) = @_; my @names; for my $token ($pattern =~ /{([^}]+)}/g) { $token = quotemeta $token; $pattern =~ s/{$token}/(.*)/; push @names, $token; } my %matches; @matches{@names} = $string =~ /$pattern/; return %matches; }

Update: fixed an error caused by last-minute changes. Thanks to ccn.

Replies are listed 'Best First'.
Re: RFC: named pattern match tokens
by diotalevi (Canon) on Oct 04, 2004 at 20:04 UTC

    You might also like to look at extending the regexp syntax. This adds the new ( ... capture ... )\C{name} element to regular expression syntax. It copies the contents of the last closed capture into the scalar variable named 'name'. So /( [\dA-F]+ ) \C{ hex }/x would copy a hex string to the $hex variable.

    use Regexp::NamedCaptures; $_ = "three - four - five"; /(\w+)\C{baz} - (\w+)\C{qux}/g; print "baz=$baz, qux=$qux\n";

    Regexp::NamedCaptures
    Updated: Changed the \N{ ... } to \C{ ... } to not conflict with named characters.
    Also changed the return value of convert() so it returns the altered expression instead of the boolean result of the s///.

    package Regexp::NamedCaptures; use overload; sub import { shift; die "No argument allowed to " . __PACKAGE__ . "::import" if @_; overload::constant qr => \ &convert; } sub convert { my $re = shift; $re =~ s( \\ ( \\ | C\{ (?>\s*) ((?>\w+)) (?>\s*) \} ) ) { defined $2 ? "(?{\$$2=\$^N})" : "\\" }xeg; $re; } 1;
      You might also like to look at extending the regexp syntax

      I thought about this, but didn't have any experience doing so. I just went with what I know, but I may take a look at your code, and see how it works.

      It copies the contents of the last closed capture into the scalar variable named 'name'

      I'm not sure I like this part. The idea of extending regular expression syntax is nice, but storing the matches in arbitrary scalars seems a bit sloppy. Maybe this can be reworked to store the results in a hash.

      Something along the lines of:

      use re 'eval'; use strict; my $re = convert('(foo)\C{ foo }'); my %hash; "foo bar" =~ $re; print $hash{foo}, "\n"; sub convert { my $re = shift; $re =~ s( \\ ( \\ | C\{ (?>\s*) ((?>\w+)) (?>\s*) \} ) ) { defined $2 ? "(?{\$hash{$2}=\$^N})" : "\\" }xeg; $re; }

      This is only marginally better, though, because instead of clobbering any arbitrary number of scalar variables, it clobbers one hash. Maybe there's a cleaner way to handle this.

        I thought the same. I think a nice name of the hash would be %~. =~ is matching so why couldn't $~{name} be a named match. Here is the code I ended up with:

        ... sub convert { my $re = shift; $re =~ s( \\ ( \\ | C\{ (?>\s*) ((?>\w+)) (?>\s*) \} ) ) { defined $2 ? "(?{\$~{$2}=\$^N})" : "\\" }xeg; "(?{undef(%~)})" # clear the %~ .$re ."(?{\$~{\$_}=\${\$_} for(1..\$#+)})"; # add the numbered matches } ... my $re = qr/(\w+)\C{baz}(?: - (\w+)\C{qux})?(\+\d+)/; "three - four - five+89" =~ $re; print "baz=$~{baz}, qux=$~{qux}, $~{3}\n";
        Please note that even the named matches got their number! Maybe they should not, I think I could implement that if I needed.

        I also considered syntax like this:

        my $re = qr/(?\$bar=\w+) - (?\$qux{not}=\w+)/;
        which could naively be implemented like this:
        ... sub convert { my $re = shift; $re =~ s<\(\?\\\$([^=]+)=([^)]*)\)><($2)(?{\$$1=\$^N})>g; $re } ...
        but the problem is that I don't know how to make sure you can do even things like :
        my $re = qr/...(?\$var=a(\d+|\w-\w+)b).../;
        I don't know how to find the right closing bracket.

        Jenda
        We'd like to help you learn to help yourself
        Look around you, all you see are sympathetic eyes
        Stroll around the grounds until you feel at home
           -- P. Simon in Mrs. Robinson

        Well that's fine. It could clobber %Regexp::NamedCapture::Captures because regexp results are already globals.
Re: RFC: named pattern match tokens
by hardburn (Abbot) on Oct 04, 2004 at 20:09 UTC

    IIRC, other languages already do this (Python, for one). Perl6 also has syntax for this. Perl5 has it in a limited way:

    my ($first_char, $word) = $_ =~ /\A (.) (\w+) /;

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      Perl5 has it in a limited way

      The Perl5 code you've shown is not really the same, because the names are defined outside of the pattern. The idea is, the pattern contains not only the thing to be matched, but the name to give the match.

        While it is true that they are not the same, it does have the benefit of not creating a hash to muck with, as hashes (as nice as they are), do not interpolate. Plus, it does not require a non-standard module to be installed :) Actually this is a really cool trick and will replace my dealings with $1 and $2 in future code, thanks hardburn.

        OT, in regard to hashes and why I don't like to use them in some places, does anybody know if Perl6 can/will do the Ruby-esque  print "hash value for key #[$key] is #[$hash{$key}]" ... basically interpolating arbitrary code (even functions) into strings. Would be cool and it definitely cuts down on verbosity of string concatenation.