in reply to Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions

After $bates_number =~ m{...} you can do

my ($pfx, $num) = map substr($bates_number, $-[$_], $+[$_] - $-[$_]), grep defined $-[$_], 1 .. $#+ ;
to get the subgroups that matched. (See perlvar for @- and @+.) This way you avoid the tricky bits about having variables in your code assertions, but more importantly: you can factor out the patterns you join together with alternation (as long as you make sure they always match with two capturing subpatterns).

lodin

Replies are listed 'Best First'.
Re^2: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by Jim (Curate) on Sep 09, 2007 at 22:45 UTC
    Excellent! This is just what I was looking for: a way to obviate those nasty, repeated code assertions. I'd read a little about @+ and @- in perlvar and perlre, but I needed a concrete example that was personal to me to help make sense of them finally.

    I soon had a function based on your suggestion.

    use English qw( -no_match_vars ); ... my $bates_number_pattern = qr{ ... }x; ... sub parse_bates_number { my $bates_number = shift; $bates_number =~ $bates_number_pattern or die "Invalid Bates number: $bates_number\n"; return map { substr $bates_number, $LAST_MATCH_START[$_], $LAST_MATCH_END[$_] - $LAST_MATCH_START[$_] } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START ); } ... my ($prefix, $number) = parse_bates_number($bates_number);
    I chose to use English to muffle the line noise a bit. I realized I didn't need to iterate the whole series of subgroups in the regular expression, I only needed to iterate through the last matched subgroup, so I used (1..$#LAST_MATCH_START) instead of (1..$#LAST_MATCH_END).

    I tested it and it worked brilliantly. But I was bothered by the fact that I was parsing the Bates numbers twice: once with a regular expression pattern and then again with substr. The two matched substrings were already captured and stored in variables--some $m and $n from the regular expression match--and yet I was extracting them anew with a string function.

    So I tried this and it, too, worked flawlessly:

    no strict 'refs'; return map { $$_ } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START );
    Because $$_ is a symbolic reference, I'm forced to countermand strict 'refs', but this is a rare, legitimate use of symbolic references, don't you think?

    Here's the revised script in its entirety:

    And here's its output:

    XYZ 123 00000123 XYZ 123 123 XYZ 123 00000456 XYZ 123 456 XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456 99999999999 99999999999 Invalid Bates number: BOGUS99
    I'm not exactly sure why I used a BEGIN block. It seems right. Is it?

    Thanks again!

    Jim

      Because $$_ is a symbolic reference, I'm forced to countermand strict 'refs', but this is a rare, legitimate use of symbolic references, don't you think?

      I agree. Though I most of the time just use Symbol's qualify_to_ref, an oft-forgotten very handy routine, I think using $$_ is clearer in this simple case. But I'd try to limit the scope as much as possible, and usually that involves a do { ... } construct.

      return grep defined, map do { no strict 'refs'; $$_ }, 1 .. $#- ;
      Here I've used (as in my original reply) that map and grep can take an expression instead of a block, so note the comma after the do block. I also check the match variables instead of the indices for definedness. It just felt nice. (In the first reply I had to check the index first.)

      I'm not exactly sure why I used a BEGIN block. It seems right. Is it?

      In this case it's a matter of taste. You don't need it, but there are some possible benefits in the future. Personally I stay away from them until I need them (and I rarely do). That way I know the code needs special care when I do see them in my own code. In either case, I'd keep the curly blackets to limit the scope of $bates_number_pattern.

      lodin