Re: split and capture some of the separators

A few (more) examples of what your trying to parse would help, but this seems to do what I think your trying to do?

my @s = ( 
    '129-129A & B-131 NORTH AV', 
    '129-129A + B-131 NORTH AV', 
    '129-129A / B-131 NORTH AV', 
    '129-129A - B-131 NORTH AV',
);
print join'|', grep $_, split '(?:\s+([&/+-])\s+)|\s+', $_ for @s;

129-129A|&|B-131|NORTH|AV
129-129A|+|B-131|NORTH|AV
129-129A|/|B-131|NORTH|AV
129-129A|-|B-131|NORTH|AV
[download]

Update after your update: Maybe this is nearer/

perl> print join'|', grep $_, split '([&/+-])|\s+', $_ for @s;
129|-|129A|&|B|-|131|NORTH|AV
129|-|129A|+|B|-|131|NORTH|AV
129|-|129A|/|B|-|131|NORTH|AV
129|-|129A|-|B|-|131|NORTH|AV
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re: split and capture some of the separators Select or Download Code

Replies are listed 'Best First'.
Re^2: split and capture some of the separators by ikegami (Patriarch) on Oct 07, 2004 at 21:51 UTC
`grep $_,` will erase some tokens such as `0`. Better to use `grep length($_),` or `grep { defined($_) && length($_) }`	[reply] [d/l] [select]
Re^3: split and capture some of the separators by BrowserUk (Patriarch) on Oct 07, 2004 at 22:29 UTC
T'is a good point in the general case. What about just `grep length, split...` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^4: split and capture some of the separators by shemp (Deacon) on Oct 07, 2004 at 22:33 UTC
If $_ is undef, you get the: Use of uninitialized value...	[reply]
Re^5: split and capture some of the separators by BrowserUk (Patriarch) on Oct 08, 2004 at 01:16 UTC
Re^2: split and capture some of the separators by shemp (Deacon) on Oct 07, 2004 at 21:47 UTC
Ok, thanks, i figured it out (sort of). If i just use `grep $_, ...` [download] I get what i want. The defined($_) wasnt catching some zero-length tokens, but im not quite sure where those zero-length tokens were coming from. This is for a general address parser for government supplied tax data, which has all sorts of strange formatting in it. I'm turning these strings into a couple DB tables eventually, so that they can be consistently searched. This part im asking about is tokenizing, then other stuff will identify different parts of the addresses, send it through address correction software, etc.	[reply] [d/l]
Re^3: split and capture some of the separators by BrowserUk (Patriarch) on Oct 07, 2004 at 22:34 UTC
but im not quite sure where those zero-length tokens were coming from. It would appear that if you use capture brackets in a split regex, that $n in returned regardless of whether the capture brackets are in that part of the regex that actually matched. And when they aren't, $1 gets set to the null string ('') rather than undef as you (and I) might suppose. I've never seen this documented but that seems to be the empirical answer. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^4: split and capture some of the separators by davido (Cardinal) on Oct 07, 2004 at 23:40 UTC
I think that it does set undef for non-captured delimiters in a split regex that contains capturing parens. Look at this example: `my $string = qq/This&and+that/; my @segments = split /(&)\|\+/, $string; print "$_\n" foreach @segments; __OUTPUT__ This & and Use of uninitialized value in (.) concatenation or string at mytest.pl + line 4. that` [download] In that example, you can see that the non-capturing portion of the match results in undef being plopped into the list element pertaining to that portion of the split. As for documentation, the POD for split says, "If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter." This is correct. It appears to be true that additional elements are created for each matching substring in the delimiter if the PATTERN contains parenthesis. But what it doesn't tell you is that though elements are created for each matching substring, those elements are only populated with a value if the corresponding portion of the PATTERN also uses capturing parens. If the specific portion of PATTERN that matched isn't captured with parens, the element is still created (since parens were used somewhere else within PATTERN), but the element isn't populated. In this case, I would consider this a bug, either in the documentation (for not documenting what happens if you combine both capturing and noncapturing components in the split PATTERN), or a bug in Perl's split, for not quite accomplishing DWIMery. Dave	[reply] [d/l]
Re^5: split and capture some of the separators by BrowserUk (Patriarch) on Oct 08, 2004 at 02:35 UTC
Re^6: split and capture some of the separators by Anonymous Monk on Oct 08, 2004 at 17:59 UTC
Re^5: split and capture some of the separators by Sandy (Curate) on Oct 08, 2004 at 15:20 UTC
Re^5: split and capture some of the separators by ihb (Deacon) on Oct 11, 2004 at 02:17 UTC