repeated regex capture

twilliamsark has asked for the wisdom of the Perl Monks concerning the following question:

Given a source string of tags value pairs where the tags are represented by the character "A" and all text that is not "A" is considered to be the value associated with the tag.

The problem is to break up the source string into sub strings of TagValue.

So a source string of "A2ABA45" needs to be broken down to:

A2
AB
A45
[download]

Here's my first attempt:

@parts=split(/(A.+?)(?=A|$)/,"A2ABA45");
$i=1;
map {print $i++ . " -> $_\n"} @parts;
[download]

produces:

1 -> 
2 -> A2
3 -> 
4 -> AB
5 ->
6 -> A45
[download]

To get what I want I ended up with:

@parts=grep{$_}split(/(A.+?)(?=A|$)/,"A2ABA45");
$i=1;
map {print $i++ . " -> $_\n"} @parts;
[download]

which produces:

1 -> A2
2 -> AB
3 -> A45
[download]

The questions are:
1) why the undef entries are there to begin with?
2) is there a better way?

Comment on repeated regex capture Select or Download Code

Replies are listed 'Best First'.
Re: repeated regex capture by kennethk (Abbot) on Apr 07, 2011 at 19:19 UTC
You are getting empty strings (not undefs) because of your capturing parentheses. As it says in split If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter. From how you've coded that, your split values are actually `''` and your delimiters are the elements you are interested in. Better is subjective, but I might do the split on any character followed by an `A`: `@parts=split(/(?=A)/,"A2ABA45");` You could also use a regular expression with the `g` modifier in list context and no capturing: `@parts = "A2ABA45" =~ /A[^A]*/g` That is less obvious to the neophyte, though. See Global matching.	[reply] [d/l] [select]
Re: repeated regex capture by wind (Priest) on Apr 07, 2011 at 19:15 UTC
First, the simple way to get what you want: `my $str = 'A2ABA45'; my @parts = $str =~ /A[^A]*/g;` [download] or `my @parts = split /(?=A)/, $str;` [download] To explain why you have undef's in your first attempt `use Data::Dumper; my @parts = split /(A.+?)(?=A\|$)/, "A2ABA45"; print Dumper(\@parts);` [download] You're treating your wanted text as delimiters that are captured in the split. Just look at this example. `my @parts = split /(-)/, "-1-2-"; print Dumper(\@parts);` [download]	[reply] [d/l] [select]
Re: repeated regex capture by locked_user sundialsvc4 (Abbot) on Apr 07, 2011 at 23:22 UTC
The `“c”` modifier can also be useful in cases like this. From `perldoc perlretut`: (Not the first place you’d probably look to find it...) The final two modifiers "//g" and "//c" concern multiple matches. The modifier "//g" stands for global matching and allows the matching operator to match within a string as many times as possible. In scalar context, successive invocations against a string will have ‘"//g" jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the "pos()" function. ... A failed match or changing the target string resets the position. If you don’t want the position reset after failure to match, add the "//c", as in "/regexp/gc". The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently. The bottom line is that these modifiers allow you to “walk through” a string. One thing that you need to be very mindful of, however, is that you must specify undef, not zero, to reset the position. From `perldoc -f pos`: Returns the offset of where the last "m//g" search left off for the variable in question ($_ is used when the variable is not specified). Note that 0 is a valid match offset. "undef" indicates that the search position is reset (usually due to match failure, but can also be because no match has yet been performed on the scalar). "pos" directly accesses the location used by the regexp engine to store the offset, so assigning to "pos" will change that offset, and so will also influence the "\G" zero‐width assertion in regular expressions. Because a failed "m//gc" match doesn’t reset the offset, the return from "pos" won’t change either in this case. (I belabor this point, hoping sincerely that I do not offend anyone by doing so, because there is a rather large lump on my forehead (and an equal-sized dent in the wall) from not carefully reading the boldfaced text! Incorrectly setting the position to zero, not `undef`, was a nass-s-s-s-s-s-ty bug that escaped detection for a long time.)