twilliamsark has asked for the wisdom of the Perl Monks concerning the following question:

Given a source string of tags value pairs where the tags are represented by the character "A" and all text that is not "A" is considered to be the value associated with the tag.

The problem is to break up the source string into sub strings of TagValue.

So a source string of "A2ABA45" needs to be broken down to:

A2 AB A45

Here's my first attempt:

@parts=split(/(A.+?)(?=A|$)/,"A2ABA45"); $i=1; map {print $i++ . " -> $_\n"} @parts;

produces:

1 -> 2 -> A2 3 -> 4 -> AB 5 -> 6 -> A45

To get what I want I ended up with:

@parts=grep{$_}split(/(A.+?)(?=A|$)/,"A2ABA45"); $i=1; map {print $i++ . " -> $_\n"} @parts;

which produces:

1 -> A2 2 -> AB 3 -> A45

The questions are:
1) why the undef entries are there to begin with?
2) is there a better way?

Replies are listed 'Best First'.
Re: repeated regex capture
by kennethk (Abbot) on Apr 07, 2011 at 19:19 UTC
    You are getting empty strings (not undefs) because of your capturing parentheses. As it says in split
    If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter.

    From how you've coded that, your split values are actually '' and your delimiters are the elements you are interested in.

    Better is subjective, but I might do the split on any character followed by an A:

    @parts=split(/(?=A)/,"A2ABA45");

    You could also use a regular expression with the g modifier in list context and no capturing:

    @parts = "A2ABA45" =~ /A[^A]*/g

    That is less obvious to the neophyte, though. See Global matching.

Re: repeated regex capture
by wind (Priest) on Apr 07, 2011 at 19:15 UTC
    First, the simple way to get what you want:
    my $str = 'A2ABA45'; my @parts = $str =~ /A[^A]*/g;
    or
    my @parts = split /(?=A)/, $str;
    To explain why you have undef's in your first attempt
    use Data::Dumper; my @parts = split /(A.+?)(?=A|$)/, "A2ABA45"; print Dumper(\@parts);
    You're treating your wanted text as delimiters that are captured in the split. Just look at this example.
    my @parts = split /(-)/, "-1-2-"; print Dumper(\@parts);
Re: repeated regex capture
by locked_user sundialsvc4 (Abbot) on Apr 07, 2011 at 23:22 UTC

    The “c” modifier can also be useful in cases like this.   From perldoc perlretut:   (Not the first place you’d probably look to find it...)

    The final two modifiers "//g" and "//c" concern multiple matches. The modifier "//g" stands for global matching and allows the matching operator to match within a string as many times as possible. In scalar context, successive invocations against a string will have ‘"//g" jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the "pos()" function.
    ...
    A failed match or changing the target string resets the position. If you don’t want the position reset after failure to match, add the "//c", as in "/regexp/gc". The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently.

    The bottom line is that these modifiers allow you to “walk through” a string.   One thing that you need to be very mindful of, however, is that you must specify undef, not zero, to reset the position.   From perldoc -f pos:

    Returns the offset of where the last "m//g" search left off for the variable in question ($_ is used when the variable is not specified). Note that 0 is a valid match offset. "undef" indicates that the search position is reset (usually due to match failure, but can also be because no match has yet been performed on the scalar). "pos" directly accesses the location used by the regexp engine to store the offset, so assigning to "pos" will change that offset, and so will also influence the "\G" zero‐width assertion in regular expressions. Because a failed "m//gc" match doesn’t reset the offset, the return from "pos" won’t change either in this case.

    (I belabor this point, hoping sincerely that I do not offend anyone by doing so, because there is a rather large lump on my forehead (and an equal-sized dent in the wall) from not carefully reading the boldfaced text!   Incorrectly setting the position to zero, not undef, was a nass-s-s-s-s-s-ty bug that escaped detection for a long time.)