in reply to Re^2: split and capture some of the separators
in thread split and capture some of the separators

but im not quite sure where those zero-length tokens were coming from.

It would appear that if you use capture brackets in a split regex, that $n in returned regardless of whether the capture brackets are in that part of the regex that actually matched. And when they aren't, $1 gets set to the null string ('') rather than undef as you (and I) might suppose.

I've never seen this documented but that seems to be the empirical answer.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re^3: split and capture some of the separators

Replies are listed 'Best First'.
Re^4: split and capture some of the separators
by davido (Cardinal) on Oct 07, 2004 at 23:40 UTC

    I think that it does set undef for non-captured delimiters in a split regex that contains capturing parens. Look at this example:

    my $string = qq/This&and+that/; my @segments = split /(&)|\+/, $string; print "$_\n" foreach @segments; __OUTPUT__ This & and Use of uninitialized value in (.) concatenation or string at mytest.pl + line 4. that

    In that example, you can see that the non-capturing portion of the match results in undef being plopped into the list element pertaining to that portion of the split.

    As for documentation, the POD for split says, "If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter."

    This is correct. It appears to be true that additional elements are created for each matching substring in the delimiter if the PATTERN contains parenthesis. But what it doesn't tell you is that though elements are created for each matching substring, those elements are only populated with a value if the corresponding portion of the PATTERN also uses capturing parens. If the specific portion of PATTERN that matched isn't captured with parens, the element is still created (since parens were used somewhere else within PATTERN), but the element isn't populated.

    In this case, I would consider this a bug, either in the documentation (for not documenting what happens if you combine both capturing and noncapturing components in the split PATTERN), or a bug in Perl's split, for not quite accomplishing DWIMery.


    Dave

      I think that your explanation is closer than mine, but you're not all the way there yet.

      print "'$_'" for split '([&/+-])|\s+', '129-129A & B-131 NORTH AV'; '129' ## Matches the first '-', produces '129' '-' ## and the captured delimiter '129A' ## Match the first space, return '129A' Use of uninit... ## and an undef for the empty capture '' ## and a nullstring? '' ## Match the '&', produces another null string +? '&' ## and the captured delimiter '' Use of uninit... ## Match the seecomd space, produce an undef '' ## and a null string? 'B' ## Match the second '-', produce the 'B' '-' ## And the captured delimiter '131' ## Match the 3rd space, produce '131' Use of uninit... ## and undef for the empty capture '' ## and a null string? 'NORTH' ## Match the fourth space, produce 'NORTH' Use of uninit... ## and undef for the empty capture '' ## and a null string for luck? 'AV' ## And the tail of the string.

      So try throwing away any whitespace around a captured match and it gets better, but still not all the way:

      print "'$_'" for split '\s*([&/+-])\s*|\s+', '129-129A & B-131 NORTH A +V'; '129' ## Match the first '-', produce '129' '-' ## and the captured delimiter '129A' ## Match ' & ', produce '129A' '&' ## and the captured delimiter 'B' ## Match the second '-', produce 'B' '-' ## and the captured delimiter '131' ## Match the first space, produce '131' Use of uninit... ## and undef for the empty delimiter '' ## and a nullstring for luck? 'NORTH' ## Match the second space, produde 'NORTH' Use of uninit... ## and undef for the empty capture '' ## and a nullstring for luck? 'AV' ## And the tail of the string.

      Which leads me to conclude that split is roughly equivalent to

      @bits = ( $string =~ m[(.*?)(?:PATTERN)]g, $' );

      Vis

      print "'$_'" for '129-129A & B-131 NORTH AV' =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g +, $'; '129' '-' '129A' '&' 'B' '-' '131' Use of uninitialized value in ... '' 'NORTH' Use of uninitialized value in ... '' 'AV'

      Which matches the output from split above exactly.

      But even that does not explain where/why the nullstrings are coming from?

      I think that there are at least two bugs here. The split docs could definitely be bolstered for the captured delimiters case, but also, the mysterious null string captures displayed by the regex above ought be fixed. Once that is fixed (if it can be) then the capturing delimiters case would be easier to explain I think.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        "Use of uninit..." is only a warning. It doesn't suppress the printing. So it prints the empty value (that's undef casted to a string) too.
      Curiosity killed the cat (meow)

      I found the null strings, which come from the undefined captures ($2 is not found, and undefined) and I guess $_ gets set to a null string if it is equated to an undefined capture?...

      #!/usr/bin/perl -w use strict; my $str = '129-129A & B-131 NORTH AV'; print "\nDon't check for undefined matches\n"; while ($str =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g){ print "\$1='$1'\t\t\$2='$2'\n"; } print "\nCheck for undefined matches\n"; while ($str =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g) { my $one = "undef"; my $two = "undef"; $one = $1 if defined $1; $two = $2 if defined $2; print "\$1='$one'\t\t\$2='$two'\n"; } print "\nPseudo Split\n"; my @b = ($str =~ m[(.*?)(?:\s*([&/+-])\s*|\s+)]g,$'); foreach (@b) { $_="undef" unless defined $_; print "'$_',\n"; }
      Result (NB: Active state does not give me the warning messages, although I get them on Solaris)
      Don't check for undefined matches $1='129' $2='-' $1='129A' $2='&' $1='B' $2='-' Use of uninitialized value in concatenation (.) or string at test.pl l +ine 8. $1='131' $2='' Use of uninitialized value in concatenation (.) or string at test.pl l +ine 8. $1='NORTH' $2='' Check for undefined matches $1='129' $2='-' $1='129A' $2='&' $1='B' $2='-' $1='131' $2='undef' $1='NORTH' $2='undef' Pseudo Split '129', '-', '129A', '&', 'B', '-', '131', 'undef', 'NORTH', 'undef', 'AV',
      UPDATE: I was actually trying to comment on Brower UK's post, but slipped up. Oh well

      I would consider this a bug, either in the documentation (for not documenting what happens if you combine both capturing and noncapturing components in the split PATTERN)

      From perlfunc 5.8.0:

      As with regular pattern matching, any capturing parentheses that are not matched in a "split()" will be set to "undef" when returned: @fields = split /(A)|B/, "1A2B3"; # @fields is (1, 'A', 2, undef, 3)

      ihb

      Read argumentation in its context!