Splitting multiline string into words, the stuff between words, and newlines

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I want to split up an ASCII text document into words (as recognized by /b{wb}), strings of the non-word characters between words, and strings of newlines.

The following code almost works, but instead of treating the newlines as separate tokens, it leaves them appended to the preceding word

#!/usr/bin/env perl
use strict;
use warnings;

my $book = do {local $/; <DATA>}; # slurp the book

# Split book into words (delimited by \b{wb}), sequences of newlines,
# and sequences of anything else.

while ($book =~
       /(
            ( \W+ )
        |
            ( \b{wb}.+?\b{wb} )
        |
            ( \n+ )
        )
       /xg)    {
    show($1);
}
print "\n";

# show(): make spaces and newlines visible
sub show {
    my $str = shift;
    $str =~ tr/\n/$/;
    $str =~ tr/ /_/;
    print "{$str}\n";
}
__DATA__
--First paragraph--
Second one's followed by only one newline. "Hello," she said, "How's t
+ricks?"

Third paragraph doesn't end with any punctuation ... and the splitting
+ works

4th one is separated by two newlines.

         The End.
[download]

The output is:

{--}
{First}
{_}
{paragraph}
{--$}                <- The newline ('$') should be separate group
{Second}
{_}
{one's}
{_}
{followed}
{_}
{by}
{_}
{only}
{_}
{one}
{_}
{newline}
{._"}
{Hello}
{,"_}
{she}
{_}
{said}
{,_"}
{How's}
{_}
{tricks}
{?"$$}            <- the two newlines should be a separate group
{Third}
{_}
{paragraph}
{_}
{doesn't}
{_}
{end}
{_}
{with}
{_}
{any}
{_}
{punctuation}
{_..._}
{and}
{_}
{the}
{_}
{splitting}
{_}
{works}              <- Correctly
{$$}                   <- split
{4th}
{_}
{one}
{_}
{is}
{_}
{separated}
{_}
{by}
{_}
{two}
{_}
{newlines}
{.$$_________}       <- should be three separate groups
{The}
{_}
{End}
{.$}
[download]

I'm wondering what I'm doing wrong, and whether there's a better solution. (Would the split function be preferable?)

Comment on Splitting multiline string into words, the stuff between words, and newlines Select or Download Code

Replies are listed 'Best First'.
Re: Splitting multiline string into words, the stuff between words, and newlines by LanX (Saint) on Feb 24, 2022 at 00:55 UTC
> I'm wondering what I'm doing wrong, `\W` is a negation of `\w` but is still including `\n` I negated both with `[^\w\n]` That's the result you wanted? Can't really comment on the rest, looks weird to me. Read more... (2 kB) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Splitting multiline string into words, the stuff between words, and newlines by ibm1620 (Hermit) on Feb 24, 2022 at 01:21 UTC
Yes, that's exactly what I wanted. Thank you. (I'm just playing with different ways of building Markov chains, a la dissociated-press.)	[reply]
Re: Splitting multiline string into words, the stuff between words, and newlines by salva (Canon) on Feb 24, 2022 at 09:06 UTC
You can also use split for that in order to not require a regular expression for matching non words: `my @fragments = grep length, split /(\b{wb}.+?\b{wb}\|\n+)/, $book;` [download] So, you get words, sequences of new lines and then everything else.	[reply] [d/l]
Re^2: Splitting multiline string into words, the stuff between words, and newlines by ibm1620 (Hermit) on Feb 24, 2022 at 12:50 UTC
This looks to me like it should work, but it splits the strings of non-words into separate characters! `"For example ...\n" -> {For}{_}{example}{_}{.}{.}{.}{$}` [download]	[reply] [d/l]
Re^3: Splitting multiline string into words, the stuff between words, and newlines by salva (Canon) on Feb 25, 2022 at 09:22 UTC
That is because `\b{wb}` matches between those signs. This seems to solve the issue: `my @fragments = grep length, split /(\b{wb}\w.*?\b{wb}\|\n+)/, $book;` [download] But my knowledge of Unicode and the `\b{wb}` semantics is rather limited so that may have other issues.	[reply] [d/l] [select]
Re^4: Splitting multiline string into words, the stuff between words, and newlines by LanX (Saint) on Feb 25, 2022 at 10:22 UTC
Re^5: Splitting multiline string into words, the stuff between words, and newlines by salva (Canon) on Feb 25, 2022 at 11:00 UTC
Some notes below your chosen depth have not been shown here
Re^4: Splitting multiline string into words, the stuff between words, and newlines by ibm1620 (Hermit) on Feb 26, 2022 at 21:00 UTC


The stupid question is the question not asked
	PerlMonks