Re: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions

Replies are listed 'Best First'.
Re^2: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by Jim (Curate) on Sep 09, 2007 at 22:45 UTC
Excellent! This is just what I was looking for: a way to obviate those nasty, repeated code assertions. I'd read a little about @+ and @- in perlvar and perlre, but I needed a concrete example that was personal to me to help make sense of them finally. I soon had a function based on your suggestion. `use English qw( -no_match_vars ); ... my $bates_number_pattern = qr{ ... }x; ... sub parse_bates_number { my $bates_number = shift; $bates_number =~ $bates_number_pattern or die "Invalid Bates number: $bates_number\n"; return map { substr $bates_number, $LAST_MATCH_START[$_], $LAST_MATCH_END[$_] - $LAST_MATCH_START[$_] } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START ); } ... my ($prefix, $number) = parse_bates_number($bates_number);` [download] I chose to `use English` to muffle the line noise a bit. I realized I didn't need to iterate the whole series of subgroups in the regular expression, I only needed to iterate through the last matched subgroup, so I used `(1..$#LAST_MATCH_START)` instead of `(1..$#LAST_MATCH_END)`. I tested it and it worked brilliantly. But I was bothered by the fact that I was parsing the Bates numbers twice: once with a regular expression pattern and then again with substr. The two matched substrings were already captured and stored in variables--some `$m` and `$n` from the regular expression match--and yet I was extracting them anew with a string function. So I tried this and it, too, worked flawlessly: `no strict 'refs'; return map { $$_ } grep { defined $LAST_MATCH_START[$_] } ( 1 .. $#LAST_MATCH_START );` [download] Because `$$_` is a symbolic reference, I'm forced to countermand `strict 'refs'`, but this is a rare, legitimate use of symbolic references, don't you think? Here's the revised script in its entirety: Read more... (2 kB) And here's its output: `XYZ 123 00000123 XYZ 123 123 XYZ 123 00000456 XYZ 123 456 XYZ 123 00654321 XYZ 123 654321 XYZ 12 ST 00123456 XYZ 12 ST 123456 XYZ 123 ST 00654321 XYZ 123 ST 654321 XYZ U 123 00123456 XYZ U 123 123456 XYZ U 12 00654321 XYZ U 12 654321 XYZ V 1 00123456 XYZ V 1 123456 XYZ 12300654321 XYZ 123 654321 XYZ 00123456 XYZ 123456 XYZ 0654321 XYZ 654321 ABC-M-0123456 ABC-M- 123456 ABCD-00654321 ABCD- 654321 00000123456 123456 99999999999 99999999999 Invalid Bates number: BOGUS99` [download] I'm not exactly sure why I used a `BEGIN` block. It seems right. Is it? Thanks again! Jim	[reply] [d/l] [select]
Re^3: Matching Multiple Alternative Patterns and Capturing Multiple Subexpressions by lodin (Hermit) on Sep 10, 2007 at 13:46 UTC
Because `$$_` is a symbolic reference, I'm forced to countermand `strict 'refs'`, but this is a rare, legitimate use of symbolic references, don't you think? I agree. Though I most of the time just use `Symbol`'s `qualify_to_ref`, an oft-forgotten very handy routine, I think using `$$_` is clearer in this simple case. But I'd try to limit the scope as much as possible, and usually that involves a `do { ... }` construct. `return grep defined, map do { no strict 'refs'; $$_ }, 1 .. $#- ;` [download] Here I've used (as in my original reply) that `map` and `grep` can take an expression instead of a block, so note the comma after the `do` block. I also check the match variables instead of the indices for definedness. It just felt nice. (In the first reply I had to check the index first.) I'm not exactly sure why I used a `BEGIN` block. It seems right. Is it? In this case it's a matter of taste. You don't need it, but there are some possible benefits in the future. Personally I stay away from them until I need them (and I rarely do). That way I know the code needs special care when I do see them in my own code. In either case, I'd keep the curly blackets to limit the scope of `$bates_number_pattern`. lodin	[reply] [d/l] [select]