comment on

Excellent! This is just what I was looking for: a way to obviate those nasty, repeated code assertions. I'd read a little about @+ and @- in perlvar and perlre, but I needed a concrete example that was personal to me to help make sense of them finally.

I soon had a function based on your suggestion.

use English qw( -no_match_vars );
...
my $bates_number_pattern = qr{ ... }x;
...
sub parse_bates_number {
    my $bates_number = shift;

    $bates_number =~ $bates_number_pattern
        or die "Invalid Bates number: $bates_number\n";

    return map  { substr $bates_number,
                         $LAST_MATCH_START[$_],
                         $LAST_MATCH_END[$_] - $LAST_MATCH_START[$_] }
           grep { defined $LAST_MATCH_START[$_] }
                ( 1 .. $#LAST_MATCH_START );
}
...
    my ($prefix, $number) = parse_bates_number($bates_number);
[download]

I chose to use English to muffle the line noise a bit. I realized I didn't need to iterate the whole series of subgroups in the regular expression, I only needed to iterate through the last matched subgroup, so I used (1..$#LAST_MATCH_START) instead of (1..$#LAST_MATCH_END).

I tested it and it worked brilliantly. But I was bothered by the fact that I was parsing the Bates numbers twice: once with a regular expression pattern and then again with substr. The two matched substrings were already captured and stored in variables--some $m and $n from the regular expression match--and yet I was extracting them anew with a string function.

So I tried this and it, too, worked flawlessly:

    no strict 'refs';

    return map  { $$_ }
           grep { defined $LAST_MATCH_START[$_] }
                ( 1 ..  $#LAST_MATCH_START );
[download]

Because $$_ is a symbolic reference, I'm forced to countermand strict 'refs', but this is a rare, legitimate use of symbolic references, don't you think?

Here's the revised script in its entirety:

#!/usr/bin/perl
#
# parse_bates_numbers_v2.pl

use strict;
use warnings;
use English qw( -no_match_vars );

BEGIN {
    my $bates_number_pattern = qr{
        ^
            #   ( Prefix                )    ( Number  )
            (?:
                ( XYZ\s\d{2,3}(?:\sST)? ) \s ( \d{8}   )
            |
                ( XYZ\s[UV]\s\d{1,3}    ) \s ( \d{8}   )
            |
                ( XYZ\s\d{3}            )    ( \d{8}   )
            |
                ( XYZ                   ) \s ( \d{7,8} )
            |
                ( ABC-M-                )    ( \d{7}   )
            |
                ( ABCD-                 )    ( \d{8}   )
            |
                (                       )    ( \d{11}  )
            )
        $
    }x;

    '12345678901' =~ $bates_number_pattern
        or die "Invalid Bates number pattern:\n$bates_number_pattern\n
+";
    
    $#LAST_MATCH_END % 2 == 0
        or die "Invalid number of parentheses in pattern: $#LAST_MATCH
+_END\n";

    sub parse_bates_number {
        my $bates_number = shift;

        $bates_number =~ $bates_number_pattern
            or die "Invalid Bates number: $bates_number\n";

        no strict 'refs';

        return map  { $$_ }
               grep { defined $LAST_MATCH_START[$_] }
                    ( 1 ..  $#LAST_MATCH_START );
    }
}

while (my $bates_number = <DATA>) {
    chomp $bates_number;

    my ($prefix, $number) = parse_bates_number($bates_number);

    printf "%-20s %-10s %12.0f\n", $bates_number, $prefix, $number;
}

exit 0;

__END__
XYZ 123 00000123
XYZ 123 00000456
XYZ 123 00654321
XYZ 12 ST 00123456
XYZ 123 ST 00654321
XYZ U 123 00123456
XYZ U 12 00654321
XYZ V 1 00123456
XYZ 12300654321
XYZ 00123456
XYZ 0654321
ABC-M-0123456
ABCD-00654321
00000123456
99999999999
BOGUS99
[download]

And here's its output:

XYZ 123 00000123     XYZ 123             123
XYZ 123 00000456     XYZ 123             456
XYZ 123 00654321     XYZ 123          654321
XYZ 12 ST 00123456   XYZ 12 ST        123456
XYZ 123 ST 00654321  XYZ 123 ST       654321
XYZ U 123 00123456   XYZ U 123        123456
XYZ U 12 00654321    XYZ U 12         654321
XYZ V 1 00123456     XYZ V 1          123456
XYZ 12300654321      XYZ 123          654321
XYZ 00123456         XYZ              123456
XYZ 0654321          XYZ              654321
ABC-M-0123456        ABC-M-           123456
ABCD-00654321        ABCD-            654321
00000123456                           123456
99999999999                      99999999999
Invalid Bates number: BOGUS99
[download]

I'm not exactly sure why I used a BEGIN block. It seems right. Is it?

Thanks again!

Jim

In reply to Re^2: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by Jim
in thread Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by Jim

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.