in reply to Re^3: split and capture some of the separators
in thread split and capture some of the separators

I think that it does set undef for non-captured delimiters in a split regex that contains capturing parens. Look at this example:

my $string = qq/This&and+that/; my @segments = split /(&)|\+/, $string; print "$_\n" foreach @segments; __OUTPUT__ This & and Use of uninitialized value in (.) concatenation or string at mytest.pl + line 4. that

In that example, you can see that the non-capturing portion of the match results in undef being plopped into the list element pertaining to that portion of the split.

As for documentation, the POD for split says, "If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter."

This is correct. It appears to be true that additional elements are created for each matching substring in the delimiter if the PATTERN contains parenthesis. But what it doesn't tell you is that though elements are created for each matching substring, those elements are only populated with a value if the corresponding portion of the PATTERN also uses capturing parens. If the specific portion of PATTERN that matched isn't captured with parens, the element is still created (since parens were used somewhere else within PATTERN), but the element isn't populated.

In this case, I would consider this a bug, either in the documentation (for not documenting what happens if you combine both capturing and noncapturing components in the split PATTERN), or a bug in Perl's split, for not quite accomplishing DWIMery.


Dave

Replies are listed 'Best First'.
Re^5: split and capture some of the separators
by BrowserUk (Patriarch) on Oct 08, 2004 at 02:35 UTC

    I think that your explanation is closer than mine, but you're not all the way there yet.

    print "'$_'" for split '([&/+-])|\s+', '129-129A & B-131 NORTH AV'; '129' ## Matches the first '-', produces '129' '-' ## and the captured delimiter '129A' ## Match the first space, return '129A' Use of uninit... ## and an undef for the empty capture '' ## and a nullstring? '' ## Match the '&', produces another null string +? '&' ## and the captured delimiter '' Use of uninit... ## Match the seecomd space, produce an undef '' ## and a null string? 'B' ## Match the second '-', produce the 'B' '-' ## And the captured delimiter '131' ## Match the 3rd space, produce '131' Use of uninit... ## and undef for the empty capture '' ## and a null string? 'NORTH' ## Match the fourth space, produce 'NORTH' Use of uninit... ## and undef for the empty capture '' ## and a null string for luck? 'AV' ## And the tail of the string.

    So try throwing away any whitespace around a captured match and it gets better, but still not all the way:

    print "'$_'" for split '\s*([&/+-])\s*|\s+', '129-129A & B-131 NORTH A +V'; '129' ## Match the first '-', produce '129' '-' ## and the captured delimiter '129A' ## Match ' & ', produce '129A' '&' ## and the captured delimiter 'B' ## Match the second '-', produce 'B' '-' ## and the captured delimiter '131' ## Match the first space, produce '131' Use of uninit... ## and undef for the empty delimiter '' ## and a nullstring for luck? 'NORTH' ## Match the second space, produde 'NORTH' Use of uninit... ## and undef for the empty capture '' ## and a nullstring for luck? 'AV' ## And the tail of the string.

    Which leads me to conclude that split is roughly equivalent to

    @bits = ( $string =~ m[(.*?)(?:PATTERN)]g, $' );

    Vis

    print "'$_'" for '129-129A & B-131 NORTH AV' =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g +, $'; '129' '-' '129A' '&' 'B' '-' '131' Use of uninitialized value in ... '' 'NORTH' Use of uninitialized value in ... '' 'AV'

    Which matches the output from split above exactly.

    But even that does not explain where/why the nullstrings are coming from?

    I think that there are at least two bugs here. The split docs could definitely be bolstered for the captured delimiters case, but also, the mysterious null string captures displayed by the regex above ought be fixed. Once that is fixed (if it can be) then the capturing delimiters case would be easier to explain I think.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      "Use of uninit..." is only a warning. It doesn't suppress the printing. So it prints the empty value (that's undef casted to a string) too.
Re^5: split and capture some of the separators
by Sandy (Curate) on Oct 08, 2004 at 15:20 UTC
    Curiosity killed the cat (meow)

    I found the null strings, which come from the undefined captures ($2 is not found, and undefined) and I guess $_ gets set to a null string if it is equated to an undefined capture?...

    #!/usr/bin/perl -w use strict; my $str = '129-129A & B-131 NORTH AV'; print "\nDon't check for undefined matches\n"; while ($str =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g){ print "\$1='$1'\t\t\$2='$2'\n"; } print "\nCheck for undefined matches\n"; while ($str =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g) { my $one = "undef"; my $two = "undef"; $one = $1 if defined $1; $two = $2 if defined $2; print "\$1='$one'\t\t\$2='$two'\n"; } print "\nPseudo Split\n"; my @b = ($str =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g,$'); foreach (@b) { $_="undef" unless defined $_; print "'$_',\n"; }
    Result (NB: Active state does not give me the warning messages, although I get them on Solaris)
    Don't check for undefined matches $1='129' $2='-' $1='129A' $2='&' $1='B' $2='-' Use of uninitialized value in concatenation (.) or string at test.pl l +ine 8. $1='131' $2='' Use of uninitialized value in concatenation (.) or string at test.pl l +ine 8. $1='NORTH' $2='' Check for undefined matches $1='129' $2='-' $1='129A' $2='&' $1='B' $2='-' $1='131' $2='undef' $1='NORTH' $2='undef' Pseudo Split '129', '-', '129A', '&', 'B', '-', '131', 'undef', 'NORTH', 'undef', 'AV',
    UPDATE: I was actually trying to comment on Brower UK's post, but slipped up. Oh well
Re^5: split and capture some of the separators
by ihb (Deacon) on Oct 11, 2004 at 02:17 UTC

    I would consider this a bug, either in the documentation (for not documenting what happens if you combine both capturing and noncapturing components in the split PATTERN)

    From perlfunc 5.8.0:

    As with regular pattern matching, any capturing parentheses that are not matched in a "split()" will be set to "undef" when returned: @fields = split /(A)|B/, "1A2B3"; # @fields is (1, 'A', 2, undef, 3)

    ihb

    Read argumentation in its context!