split and capture some of the separators

shemp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
•Re: split and capture some of the separators by merlyn (Sage) on Oct 07, 2004 at 21:44 UTC
I have almost always found that a "capture split" is better replaced by a proper m//g instead. Have you considered that? -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^2: split and capture some of the separators by shemp (Deacon) on Oct 07, 2004 at 21:58 UTC
I did think about that, but i cant quite see how to do it in this case. Part of the problem for me is that the separators are well-defined, but what between them could be anything (except a separator).	[reply]
Re^3: split and capture some of the separators by bart (Canon) on Oct 07, 2004 at 22:18 UTC
You can use the following "continue to split" mechanism in a loop: `(my($token, $sep), $string) = split /PATTERN/, $string, 2;` [download] This will load the matched string (the separator) into $sep, the stuff before that into $token, and the rest of the string right after the match, into the string, shortening it, ready for the next iteration — provided you have exactly one pair of capturing parens in the pattern. It's almost identical in effect (bar the negative impact on the global speed of regexes) as using the special variables $`, `$&`, `$'` on a normal match, using the same pattern. If you could have more capturing parens, you can do: `my($token, @sep) = split /PATTERN/, $string, 2; $string = pop @sep;` [download] leaving all the captured separators in `@sep`.	[reply] [d/l] [select]
Re^3: split and capture some of the separators by Jasper (Chaplain) on Oct 08, 2004 at 09:14 UTC
Part of the problem for me is that the separators are well-defined, but what between them could be anything (except a separator). This looks (to me) like that sentence written in perl: `@list = /([SEPARATORS])+([^SEPARATORS])*/g;` [download] You could include `\s` in the second character class if you wanted to ignore whitespace. Of course, I've been wrong in the past :)	[reply] [d/l] [select]
Re: split and capture some of the separators by BrowserUk (Patriarch) on Oct 07, 2004 at 21:33 UTC
A few (more) examples of what your trying to parse would help, but this seems to do what I think your trying to do? `my @s = ( '129-129A & B-131 NORTH AV', '129-129A + B-131 NORTH AV', '129-129A / B-131 NORTH AV', '129-129A - B-131 NORTH AV', ); print join'\|', grep $_, split '(?:\s+([&/+-])\s+)\|\s+', $_ for @s; 129-129A\|&\|B-131\|NORTH\|AV 129-129A\|+\|B-131\|NORTH\|AV 129-129A\|/\|B-131\|NORTH\|AV 129-129A\|-\|B-131\|NORTH\|AV` [download] Update after your update: Maybe this is nearer/ `perl> print join'\|', grep $_, split '([&/+-])\|\s+', $_ for @s; 129\|-\|129A\|&\|B\|-\|131\|NORTH\|AV 129\|-\|129A\|+\|B\|-\|131\|NORTH\|AV 129\|-\|129A\|/\|B\|-\|131\|NORTH\|AV 129\|-\|129A\|-\|B\|-\|131\|NORTH\|AV` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^2: split and capture some of the separators by ikegami (Patriarch) on Oct 07, 2004 at 21:51 UTC
`grep $_,` will erase some tokens such as `0`. Better to use `grep length($_),` or `grep { defined($_) && length($_) }`	[reply] [d/l] [select]
Re^3: split and capture some of the separators by BrowserUk (Patriarch) on Oct 07, 2004 at 22:29 UTC
T'is a good point in the general case. What about just `grep length, split...` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^4: split and capture some of the separators by shemp (Deacon) on Oct 07, 2004 at 22:33 UTC
Re^5: split and capture some of the separators by BrowserUk (Patriarch) on Oct 08, 2004 at 01:16 UTC
Re^2: split and capture some of the separators by shemp (Deacon) on Oct 07, 2004 at 21:47 UTC
Ok, thanks, i figured it out (sort of). If i just use `grep $_, ...` [download] I get what i want. The defined($_) wasnt catching some zero-length tokens, but im not quite sure where those zero-length tokens were coming from. This is for a general address parser for government supplied tax data, which has all sorts of strange formatting in it. I'm turning these strings into a couple DB tables eventually, so that they can be consistently searched. This part im asking about is tokenizing, then other stuff will identify different parts of the addresses, send it through address correction software, etc.	[reply] [d/l]
Re^3: split and capture some of the separators by BrowserUk (Patriarch) on Oct 07, 2004 at 22:34 UTC
but im not quite sure where those zero-length tokens were coming from. It would appear that if you use capture brackets in a split regex, that $n in returned regardless of whether the capture brackets are in that part of the regex that actually matched. And when they aren't, $1 gets set to the null string ('') rather than undef as you (and I) might suppose. I've never seen this documented but that seems to be the empirical answer. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^4: split and capture some of the separators by davido (Cardinal) on Oct 07, 2004 at 23:40 UTC
Re^5: split and capture some of the separators by BrowserUk (Patriarch) on Oct 08, 2004 at 02:35 UTC
Some notes below your chosen depth have not been shown here
Re^5: split and capture some of the separators by Sandy (Curate) on Oct 08, 2004 at 15:20 UTC
Re^5: split and capture some of the separators by ihb (Deacon) on Oct 11, 2004 at 02:17 UTC
Re: split and capture some of the separators by ikegami (Patriarch) on Oct 07, 2004 at 21:30 UTC
`grep defined($_), split` doesn't work for me. `grep length($_), split` does.	[reply] [d/l] [select]
Re: split and capture some of the separators by ihb (Deacon) on Oct 11, 2004 at 02:34 UTC
The reason you're getting the zero-length elements is that you have two delimiters that follow eachother. Inbetween those delimiters there's nothing, and therefore you get a string holding nothing. Example: `$_ = 'XABX'; print "<$_>" for split /A\|B/; __END__ <X> <> <X>` [download] Step by step, it goes a little like this: `'X' . 'A' . 'BX' # 'A' matched. 'BX' # 'A' removed, 'X' returned. '' . 'B' . 'X' # 'B' matched. 'X' # 'B' removed, '' returned.` [download] `ihb` Read argumentation in its context!	[reply] [d/l] [select]