in reply to Re: perlretut - Perl regular expressions tutorial curveball
in thread perlretut - Perl regular expressions tutorial curveball

I somehow agree that the example and explanation ("whole string", "language processing") in the tutorial is not well chosen.

An example showing how to split ...

... into individual words might be better.

That's actually still not that easy, just Perl complying to Unicode standards doesn't make it trivial

Edit
The best I could up with is to use a split on word boundaries, and to discard punctuation and whitespace in a grep.

DB<13> $str = "I don't think 'don't' isn't > DB<14> x split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ' ' 6 '\'' 7 'don\'t' 8 '\'' 9 ' ' 10 'isn\'t' 11 ' ' 12 'a' 13 ' ' 14 'word' DB<15>

Update

Like... ($str expanded with more edge cases)

DB<27> $str = "I don't think, 'don't' isn't a word..." DB<28> x @list= split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ',' 6 ' ' 7 '\'' 8 'don\'t' 9 '\'' 10 ' ' 11 'isn\'t' 12 ' ' 13 'a' 14 ' ' 15 'word' 16 '.' 17 '.' 18 '.' DB<29> x grep { not /^\W|\s+$/ } @list 0 'I' 1 'don\'t' 2 'think' 3 'don\'t' 4 'isn\'t' 5 'a' 6 'word' DB<30>

Update

FWIW grep { ! /^\W+$/ } yield the same result, but I'm not convinced the example is already covering all edge cases...

Update 2025-04-13

FWIW

"Francis' car" is an example for what would still fail. The apostrophe will not be part of the first word after splitting. Admittedly a tough problem.

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery