I soon had a function based on your suggestion.
I chose to use English to muffle the line noise a bit. I realized I didn't need to iterate the whole series of subgroups in the regular expression, I only needed to iterate through the last matched subgroup, so I used (1..$#LAST_MATCH_START) instead of (1..$#LAST_MATCH_END).use English qw( -no_match_vars ); ... my $bates_number_pattern = qr{ ... }x; ... sub parse_bates_number { my $bates_number = shift; $bates_number =~ $bates_number_pattern or die "Invalid Bates number: $bates_number\n"; return map { substr $bates_number, $LAST_MATCH_START[$_], $LAST_MATCH_END[$_] - $LAST_MATCH_START[$_] } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START ); } ... my ($prefix, $number) = parse_bates_number($bates_number);
I tested it and it worked brilliantly. But I was bothered by the fact that I was parsing the Bates numbers twice: once with a regular expression pattern and then again with substr. The two matched substrings were already captured and stored in variables--some $m and $n from the regular expression match--and yet I was extracting them anew with a string function.
So I tried this and it, too, worked flawlessly:
Because $$_ is a symbolic reference, I'm forced to countermand strict 'refs', but this is a rare, legitimate use of symbolic references, don't you think?no strict 'refs'; return map { $$_ } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START );
Here's the revised script in its entirety:
#!/usr/bin/perl # # parse_bates_numbers_v2.pl use strict; use warnings; use English qw( -no_match_vars ); BEGIN { my $bates_number_pattern = qr{ ^ # ( Prefix ) ( Number ) (?: ( XYZ\s\d{2,3}(?:\sST)? ) \s ( \d{8} ) | ( XYZ\s[UV]\s\d{1,3} ) \s ( \d{8} ) | ( XYZ\s\d{3} ) ( \d{8} ) | ( XYZ ) \s ( \d{7,8} ) | ( ABC-M- ) ( \d{7} ) | ( ABCD- ) ( \d{8} ) | ( ) ( \d{11} ) ) $ }x; '12345678901' =~ $bates_number_pattern or die "Invalid Bates number pattern:\n$bates_number_pattern\n +"; $#LAST_MATCH_END % 2 == 0 or die "Invalid number of parentheses in pattern: $#LAST_MATCH +_END\n"; sub parse_bates_number { my $bates_number = shift; $bates_number =~ $bates_number_pattern or die "Invalid Bates number: $bates_number\n"; no strict 'refs'; return map { $$_ } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START ); } } while (my $bates_number = <DATA>) { chomp $bates_number; my ($prefix, $number) = parse_bates_number($bates_number); printf "%-20s %-10s %12.0f\n", $bates_number, $prefix, $number; } exit 0; __END__ XYZ 123 00000123 XYZ 123 00000456 XYZ 123 00654321 XYZ 12 ST 00123456 XYZ 123 ST 00654321 XYZ U 123 00123456 XYZ U 12 00654321 XYZ V 1 00123456 XYZ 12300654321 XYZ 00123456 XYZ 0654321 ABC-M-0123456 ABCD-00654321 00000123456 99999999999 BOGUS99
And here's its output:
I'm not exactly sure why I used a BEGIN block. It seems right. Is it?XYZ 123 00000123 XYZ 123 123 XYZ 123 00000456 XYZ 123 456 XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456 99999999999 99999999999 Invalid Bates number: BOGUS99
Thanks again!
Jim
In reply to Re^2: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by Jim
in thread Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions
by Jim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |