Jim has asked for the wisdom of the Perl Monks concerning the following question:

What's the best way to match against multiple, carefully-arranged alternative patterns and capture multiple matched subexpressions in a finite number of variables (in this case, two: $pfx and $num)?

(The truth is, I can't figure out how to ask the question I mean to ask. Please infer and interpret liberally and generously.)

I did this. I don't like it because it depends on esoteric regular expression stuff and because there are a bunch of repeated assignments to the same two variables. Is there a better way to accomplish the same parsing task?

$ cat parse_bates_numbers.pl #!/usr/bin/perl # # parse_bates_numbers.pl use strict; use warnings; $\ = "\n"; $, = "\t"; my ($pfx, $num); while (my $bates_number = <DATA>) { chomp $bates_number; undef $pfx; undef $num; $bates_number =~ m{ \A (?: # XYZ 999 99999999 or XYZ 99 ST 99999999 or XYZ 999 ST 9999999 +9 (XYZ\s\d{2,3}(?:\sST)?) (?{ $pfx = $^N }) \s(\d{8}) (?{ $num + = $^N }) | # XYZ U 999 99999999 or XYZ U 99 99999999 or XYZ V 9 99999999 (XYZ\s[UV]\s\d{1,3}) (?{ $pfx = $^N }) \s(\d{8}) (?{ $num + = $^N }) | # XYZ 99999999999 (XYZ\s\d{3}) (?{ $pfx = $^N }) (\d{8}) (?{ $num + = $^N }) | # XYZ 99999999 or XYZ 9999999 (XYZ) (?{ $pfx = $^N }) \s(\d{7,8}) (?{ $num + = $^N }) | # ABC-M-9999999 (ABC-M-) (?{ $pfx = $^N }) (\d{7}) (?{ $num + = $^N }) | # ABCD-99999999 (ABCD-) (?{ $pfx = $^N }) (\d{8}) (?{ $num + = $^N }) | # 99999999999 () (?{ $pfx = $^N }) (\d{11}) (?{ $num + = $^N }) ) \z }x or die "Invalid Bates number $bates_number"; print $bates_number, $pfx, $num + 0; } exit 0; __END__ XYZ 123 00654321 XYZ 12 ST 00123456 XYZ 123 ST 00654321 XYZ U 123 00123456 XYZ U 12 00654321 XYZ V 1 00123456 XYZ 12300654321 XYZ 00123456 XYZ 0654321 ABC-M-0123456 ABCD-00654321 00000123456 $ perl ./parse_bates_numbers.pl | expand -20 XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456 $
Also, why do I have to declare the variables $pfx and $num outside the while loop for this to work properly?

Jim

  • Comment on Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
  • Download Code

Replies are listed 'Best First'.
Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by bobf (Monsignor) on Sep 08, 2007 at 21:02 UTC

    I started thinking about using a few nested if statements to simplify the regexes used at each stage, but that looked more complicated than necessary. I settled on the following, which eliminates the code blocks (assignments to variables) in the middle of the regex (which, according to perlre, is "considered highly experimental, and may be changed or deleted without notice") and uses multiple regexes rather than a single one with alternation. IMO, eliminating the code blocks and laying out the patterns so the common parts are aligned makes them easier to read.

    Note that I don't use the temporary lexical variables $pfx and $num because the regex is inside an if block and the captured strings are only used in a single print statement. If you're doing more than printing, you could add them back in as shown in the comment.

    I also removed the die so the program would continue processing, but you should use whatever is most appropriate in your situation.

    use strict; use warnings; while( my $bates_number = <DATA> ) { chomp $bates_number; if( $bates_number =~ m/^( XYZ \s \d{2,3} (?:\sST)? ) \s ( \d{8} +)$/x || $bates_number =~ m/^( XYZ \s [UV] \s \d{1,3} ) \s ( \d{8} +)$/x || $bates_number =~ m/^( XYZ \s \d{3} ) ( \d{8} +)$/x || $bates_number =~ m/^( XYZ ) \s ( \d{7,8} +)$/x || $bates_number =~ m/^( ABC-M- ) ( \d{7} +)$/x || $bates_number =~ m/^( ABCD- ) ( \d{8} +)$/x || $bates_number =~ m/^( ) ( \d{11} +)$/x ) { # my ( $pfx, $num ) = ( $1, $2 ); # could assign here printf( "%-20s %-12s % d\n", $bates_number, $1, $2 ); } else { print "Invalid Bates number: $bates_number\n"; } } __DATA__ XYZ 123 00654321 XYZ 12 ST 00123456 XYZ 123 ST 00654321 XYZ U 123 00123456 XYZ U 12 00654321 XYZ V 1 00123456 XYZ 12300654321 XYZ 00123456 XYZ 0654321 ABC-M-0123456 ABCD-00654321 00000123456

    Output:

    XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456

Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by lodin (Hermit) on Sep 08, 2007 at 20:41 UTC

    After $bates_number =~ m{...} you can do

    my ($pfx, $num) = map substr($bates_number, $-[$_], $+[$_] - $-[$_]), grep defined $-[$_], 1 .. $#+ ;
    to get the subgroups that matched. (See perlvar for @- and @+.) This way you avoid the tricky bits about having variables in your code assertions, but more importantly: you can factor out the patterns you join together with alternation (as long as you make sure they always match with two capturing subpatterns).

    lodin

      Excellent! This is just what I was looking for: a way to obviate those nasty, repeated code assertions. I'd read a little about @+ and @- in perlvar and perlre, but I needed a concrete example that was personal to me to help make sense of them finally.

      I soon had a function based on your suggestion.

      use English qw( -no_match_vars ); ... my $bates_number_pattern = qr{ ... }x; ... sub parse_bates_number { my $bates_number = shift; $bates_number =~ $bates_number_pattern or die "Invalid Bates number: $bates_number\n"; return map { substr $bates_number, $LAST_MATCH_START[$_], $LAST_MATCH_END[$_] - $LAST_MATCH_START[$_] } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START ); } ... my ($prefix, $number) = parse_bates_number($bates_number);
      I chose to use English to muffle the line noise a bit. I realized I didn't need to iterate the whole series of subgroups in the regular expression, I only needed to iterate through the last matched subgroup, so I used (1..$#LAST_MATCH_START) instead of (1..$#LAST_MATCH_END).

      I tested it and it worked brilliantly. But I was bothered by the fact that I was parsing the Bates numbers twice: once with a regular expression pattern and then again with substr. The two matched substrings were already captured and stored in variables--some $m and $n from the regular expression match--and yet I was extracting them anew with a string function.

      So I tried this and it, too, worked flawlessly:

      no strict 'refs'; return map { $$_ } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START );
      Because $$_ is a symbolic reference, I'm forced to countermand strict 'refs', but this is a rare, legitimate use of symbolic references, don't you think?

      Here's the revised script in its entirety:

      And here's its output:

      XYZ 123 00000123 XYZ 123 123 XYZ 123 00000456 XYZ 123 456 XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456 99999999999 99999999999 Invalid Bates number: BOGUS99
      I'm not exactly sure why I used a BEGIN block. It seems right. Is it?

      Thanks again!

      Jim

        Because $$_ is a symbolic reference, I'm forced to countermand strict 'refs', but this is a rare, legitimate use of symbolic references, don't you think?

        I agree. Though I most of the time just use Symbol's qualify_to_ref, an oft-forgotten very handy routine, I think using $$_ is clearer in this simple case. But I'd try to limit the scope as much as possible, and usually that involves a do { ... } construct.

        return grep defined, map do { no strict 'refs'; $$_ }, 1 .. $#- ;
        Here I've used (as in my original reply) that map and grep can take an expression instead of a block, so note the comma after the do block. I also check the match variables instead of the indices for definedness. It just felt nice. (In the first reply I had to check the index first.)

        I'm not exactly sure why I used a BEGIN block. It seems right. Is it?

        In this case it's a matter of taste. You don't need it, but there are some possible benefits in the future. Personally I stay away from them until I need them (and I rarely do). That way I know the code needs special care when I do see them in my own code. In either case, I'd keep the curly blackets to limit the scope of $bates_number_pattern.

        lodin

Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by graff (Chancellor) on Sep 08, 2007 at 23:00 UTC
    This does the same thing as your code, though I don't know if it would extend correctly to other cases (if there are any other cases):
    #!/usr/bin/perl use strict; use warnings; $\ = "\n"; $, = "\t"; while (<DATA>) { chomp; my ( $pfx, $num ); if ( /^\d+$/ ) { $pfx = ''; ( $num = $_ ) =~ s/^0+//; } elsif ( /^\S+$/ ) { ( $pfx, $num ) = ( /(\D+)0+(\d+)/ ); } else { my $last_space = rindex( $_, ' ' ) +1; ( $pfx = substr( $_, 0, $last_space )) =~ s/\s+$//; ( $num = substr( $_, $last_space )) =~ s/^([^0]*)0+//; $pfx .= " $1" if ( length( $1 )); } print $_, $pfx, $num; } __DATA__ XYZ 123 00654321 XYZ 12 ST 00123456 XYZ 123 ST 00654321 XYZ U 123 00123456 XYZ U 12 00654321 XYZ V 1 00123456 XYZ 12300654321 XYZ 00123456 XYZ 0654321 ABC-M-0123456 ABCD-00654321 00000123456
    I like seeing the distinctions laid out in procedural conditions like this, rather than as a lengthy regex involving complex, perl-regex-specific features -- it just seems easier to read -- but that's just my personal preference.

    update: Looking at the OP again, I realize that the specificity of the various patterns in the OP code is intended as a sort of sanity check on the input (die if there are no specific matches).

    In that regard -- again, just my personal view -- it might be easier (more legible / maintainable) to apply sanity checks to the individual result strings ($pfx, $num) after they've been picked apart from the input string by the kind of generic logic I suggested here; e.g., add an if block like this just before the print statement:

    if ( $num !~ /^\d{1,7}$/ or $pfx !~ /^(?: ABC(?:D|-M)- | XYZ(?:\s[UV])? (?:\s\d{1,3} (?:\sST)? )? )$/x ) + { warn "Bad input at line $.\n"; next; }
    (Then again, that last regex addmittedly looks like the sort of thing that people usually point to as "line noise". I'm sure there are more legible ways of doing the same thing.)
Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by johngg (Canon) on Sep 08, 2007 at 22:29 UTC
    Also, why do I have to declare the variables $pfx and $num outside the while loop for this to work properly?

    I think that is because the regular expression code blocks create closures around $pfx and $num when the regex is compiled. This means that the assignments will always be to the $pfx and $num that were declared inside the while loop the first time through. I think doing

    local our ($pfx, $num);

    inside the while would work but I've not tested it.

    I hope this is of use.

    Cheers,

    JohnGG

      I tested it. It does work.

      Thanks!

      Jim