Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions

Jim has asked for the wisdom of the Perl Monks concerning the following question:

What's the best way to match against multiple, carefully-arranged alternative patterns and capture multiple matched subexpressions in a finite number of variables (in this case, two: $pfx and $num)?

(The truth is, I can't figure out how to ask the question I mean to ask. Please infer and interpret liberally and generously.)

I did this. I don't like it because it depends on esoteric regular expression stuff and because there are a bunch of repeated assignments to the same two variables. Is there a better way to accomplish the same parsing task?

$ cat parse_bates_numbers.pl
#!/usr/bin/perl
#
# parse_bates_numbers.pl

use strict;
use warnings;

$\ = "\n";
$, = "\t";

my ($pfx, $num);

while (my $bates_number = <DATA>) {
    chomp $bates_number;

    undef $pfx;
    undef $num;

    $bates_number =~ m{
        \A
        (?:
        # XYZ 999 99999999 or XYZ 99 ST 99999999 or XYZ 999 ST 9999999
+9
        (XYZ\s\d{2,3}(?:\sST)?) (?{ $pfx = $^N }) \s(\d{8})   (?{ $num
+ = $^N })
        |
        # XYZ U 999 99999999 or XYZ U 99 99999999 or XYZ V 9 99999999
        (XYZ\s[UV]\s\d{1,3})    (?{ $pfx = $^N }) \s(\d{8})   (?{ $num
+ = $^N })
        |
        # XYZ 99999999999
        (XYZ\s\d{3})            (?{ $pfx = $^N })   (\d{8})   (?{ $num
+ = $^N })
        |
        # XYZ 99999999 or XYZ 9999999
        (XYZ)                   (?{ $pfx = $^N }) \s(\d{7,8}) (?{ $num
+ = $^N })
        |
        # ABC-M-9999999
        (ABC-M-)                (?{ $pfx = $^N })   (\d{7})   (?{ $num
+ = $^N })
        |
        # ABCD-99999999
        (ABCD-)                 (?{ $pfx = $^N })   (\d{8})   (?{ $num
+ = $^N })
        |
        # 99999999999
        ()                      (?{ $pfx = $^N })   (\d{11})  (?{ $num
+ = $^N })
        )
        \z
    }x
        or die "Invalid Bates number $bates_number";

    print $bates_number, $pfx, $num + 0;
}

exit 0;

__END__
XYZ 123 00654321
XYZ 12 ST 00123456
XYZ 123 ST 00654321
XYZ U 123 00123456
XYZ U 12 00654321
XYZ V 1 00123456
XYZ 12300654321
XYZ 00123456
XYZ 0654321
ABC-M-0123456
ABCD-00654321
00000123456
$ perl ./parse_bates_numbers.pl | expand -20
XYZ 123 00654321    XYZ 123             654321
XYZ 12 ST 00123456  XYZ 12 ST           123456
XYZ 123 ST 00654321 XYZ 123 ST          654321
XYZ U 123 00123456  XYZ U 123           123456
XYZ U 12 00654321   XYZ U 12            654321
XYZ V 1 00123456    XYZ V 1             123456
XYZ 12300654321     XYZ 123             654321
XYZ 00123456        XYZ                 123456
XYZ 0654321         XYZ                 654321
ABC-M-0123456       ABC-M-              123456
ABCD-00654321       ABCD-               654321
00000123456                             123456
$
[download]

Also, why do I have to declare the variables $pfx and $num outside the while loop for this to work properly?

Jim

Comment on Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions Download Code

Replies are listed 'Best First'.
Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by bobf (Monsignor) on Sep 08, 2007 at 21:02 UTC
I started thinking about using a few nested if statements to simplify the regexes used at each stage, but that looked more complicated than necessary. I settled on the following, which eliminates the code blocks (assignments to variables) in the middle of the regex (which, according to perlre, is "considered highly experimental, and may be changed or deleted without notice") and uses multiple regexes rather than a single one with alternation. IMO, eliminating the code blocks and laying out the patterns so the common parts are aligned makes them easier to read. Note that I don't use the temporary lexical variables `$pfx` and `$num` because the regex is inside an if block and the captured strings are only used in a single print statement. If you're doing more than printing, you could add them back in as shown in the comment. I also removed the die so the program would continue processing, but you should use whatever is most appropriate in your situation. use strict; use warnings; while( my $bates_number = <DATA> ) { chomp $bates_number; if( $bates_number =~ m/^( XYZ \s \d{2,3} (?:\sST)? ) \s ( \d{8} +)$/x \|\| $bates_number =~ m/^( XYZ \s [UV] \s \d{1,3} ) \s ( \d{8} +)$/x \|\| $bates_number =~ m/^( XYZ \s \d{3} ) ( \d{8} +)$/x \|\| $bates_number =~ m/^( XYZ ) \s ( \d{7,8} +)$/x \|\| $bates_number =~ m/^( ABC-M- ) ( \d{7} +)$/x \|\| $bates_number =~ m/^( ABCD- ) ( \d{8} +)$/x \|\| $bates_number =~ m/^( ) ( \d{11} +)$/x ) { # my ( $pfx, $num ) = ( $1, $2 ); # could assign here printf( "%-20s %-12s % d\n", $bates_number, $1, $2 ); } else { print "Invalid Bates number: $bates_number\n"; } } __DATA__ XYZ 123 00654321 XYZ 12 ST 00123456 XYZ 123 ST 00654321 XYZ U 123 00123456 XYZ U 12 00654321 XYZ V 1 00123456 XYZ 12300654321 XYZ 00123456 XYZ 0654321 ABC-M-0123456 ABCD-00654321 00000123456 [download] Output: `XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456` [download]	[reply] [d/l] [select]
Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by lodin (Hermit) on Sep 08, 2007 at 20:41 UTC
After `$bates_number =~ m{...}` you can do `my ($pfx, $num) = map substr($bates_number, $-[$_], $+[$_] - $-[$_]), grep defined $-[$_], 1 .. $#+ ;` [download] to get the subgroups that matched. (See perlvar for `@-` and `@+`.) This way you avoid the tricky bits about having variables in your code assertions, but more importantly: you can factor out the patterns you join together with alternation (as long as you make sure they always match with two capturing subpatterns). lodin	[reply] [d/l] [select]
Re^2: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by Jim (Curate) on Sep 09, 2007 at 22:45 UTC
Excellent! This is just what I was looking for: a way to obviate those nasty, repeated code assertions. I'd read a little about @+ and @- in perlvar and perlre, but I needed a concrete example that was personal to me to help make sense of them finally. I soon had a function based on your suggestion. `use English qw( -no_match_vars ); ... my $bates_number_pattern = qr{ ... }x; ... sub parse_bates_number { my $bates_number = shift; $bates_number =~ $bates_number_pattern or die "Invalid Bates number: $bates_number\n"; return map { substr $bates_number, $LAST_MATCH_START[$_], $LAST_MATCH_END[$_] - $LAST_MATCH_START[$_] } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START ); } ... my ($prefix, $number) = parse_bates_number($bates_number);` [download] I chose to `use English` to muffle the line noise a bit. I realized I didn't need to iterate the whole series of subgroups in the regular expression, I only needed to iterate through the last matched subgroup, so I used `(1..$#LAST_MATCH_START)` instead of `(1..$#LAST_MATCH_END)`. I tested it and it worked brilliantly. But I was bothered by the fact that I was parsing the Bates numbers twice: once with a regular expression pattern and then again with substr. The two matched substrings were already captured and stored in variables--some `$m` and `$n` from the regular expression match--and yet I was extracting them anew with a string function. So I tried this and it, too, worked flawlessly: `no strict 'refs'; return map { $$_ } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START );` [download] Because `$$_` is a symbolic reference, I'm forced to countermand `strict 'refs'`, but this is a rare, legitimate use of symbolic references, don't you think? Here's the revised script in its entirety: Read more... (2 kB) And here's its output: `XYZ 123 00000123 XYZ 123 123 XYZ 123 00000456 XYZ 123 456 XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456 99999999999 99999999999 Invalid Bates number: BOGUS99` [download] I'm not exactly sure why I used a `BEGIN` block. It seems right. Is it? Thanks again! Jim	[reply] [d/l] [select]
Re^3: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by lodin (Hermit) on Sep 10, 2007 at 13:46 UTC
Because `$$_` is a symbolic reference, I'm forced to countermand `strict 'refs'`, but this is a rare, legitimate use of symbolic references, don't you think? I agree. Though I most of the time just use `Symbol`'s `qualify_to_ref`, an oft-forgotten very handy routine, I think using `$$_` is clearer in this simple case. But I'd try to limit the scope as much as possible, and usually that involves a `do { ... }` construct. `return grep defined, map do { no strict 'refs'; $$_ }, 1 .. $#- ;` [download] Here I've used (as in my original reply) that `map` and `grep` can take an expression instead of a block, so note the comma after the `do` block. I also check the match variables instead of the indices for definedness. It just felt nice. (In the first reply I had to check the index first.) I'm not exactly sure why I used a `BEGIN` block. It seems right. Is it? In this case it's a matter of taste. You don't need it, but there are some possible benefits in the future. Personally I stay away from them until I need them (and I rarely do). That way I know the code needs special care when I do see them in my own code. In either case, I'd keep the curly blackets to limit the scope of `$bates_number_pattern`. lodin	[reply] [d/l] [select]
Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by graff (Chancellor) on Sep 08, 2007 at 23:00 UTC
This does the same thing as your code, though I don't know if it would extend correctly to other cases (if there are any other cases): #!/usr/bin/perl use strict; use warnings; $\ = "\n"; $, = "\t"; while (<DATA>) { chomp; my ( $pfx, $num ); if ( /^\d+$/ ) { $pfx = ''; ( $num = $_ ) =~ s/^0+//; } elsif ( /^\S+$/ ) { ( $pfx, $num ) = ( /(\D+)0+(\d+)/ ); } else { my $last_space = rindex( $_, ' ' ) +1; ( $pfx = substr( $_, 0, $last_space )) =~ s/\s+$//; ( $num = substr( $_, $last_space )) =~ s/^([^0])0+//; $pfx .= " $1" if ( length( $1 )); } print $_, $pfx, $num; } __DATA__ XYZ 123 00654321 XYZ 12 ST 00123456 XYZ 123 ST 00654321 XYZ U 123 00123456 XYZ U 12 00654321 XYZ V 1 00123456 XYZ 12300654321 XYZ 00123456 XYZ 0654321 ABC-M-0123456 ABCD-00654321 00000123456 [download] I like seeing the distinctions laid out in procedural conditions like this, rather than as a lengthy regex involving complex, perl-regex-specific features -- it just seems easier to read -- but that's just my personal preference. update:* Looking at the OP again, I realize that the specificity of the various patterns in the OP code is intended as a sort of sanity check on the input (die if there are no specific matches). In that regard -- again, just my personal view -- it might be easier (more legible / maintainable) to apply sanity checks to the individual result strings ($pfx, $num) after they've been picked apart from the input string by the kind of generic logic I suggested here; e.g., add an if block like this just before the print statement: `if ( $num !~ /^\d{1,7}$/ or $pfx !~ /^(?: ABC(?:D\|-M)- \| XYZ(?:\s[UV])? (?:\s\d{1,3} (?:\sST)? )? )$/x ) + { warn "Bad input at line $.\n"; next; }` [download] (Then again, that last regex addmittedly looks like the sort of thing that people usually point to as "line noise". I'm sure there are more legible ways of doing the same thing.)	[reply] [d/l] [select]
Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by johngg (Canon) on Sep 08, 2007 at 22:29 UTC
Also, why do I have to declare the variables $pfx and $num outside the while loop for this to work properly? I think that is because the regular expression code blocks create closures around `$pfx` and `$num` when the regex is compiled. This means that the assignments will always be to the `$pfx` and `$num` that were declared inside the while loop the first time through. I think doing `local our ($pfx, $num);` inside the while would work but I've not tested it. I hope this is of use. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by Jim (Curate) on Sep 09, 2007 at 22:51 UTC
I tested it. It does work. Thanks! Jim	[reply]