in reply to perlretut - Perl regular expressions tutorial curveball

\b is the standard word boundary matching between any change from \w to \W and vice versa.

But a single quote "'" is not an alphanumeric character² in the class \w, but in the opposing \W!

Hence in the example only "don" would be matched. ¹

(Actually, to be more precise "don't" should be written with an apostrophe not a quote, but yeah computers you know ;)

5.22 introduced \b{wb} to "process" characters which appear inside words of "natural languages" like English according to Unicode rules.

Now "don't" can be matched.

Hope it's clearer now.

How to find documentation

perlretut is a tutorial, you need to lookup details in perlre which is the central reference. After searching "wb" there you'll be delegated to perlrebackslash , which is detailing backslash sequences

In this case a discussion is found in \b{}, \b, \B{}, \B

With explicit details in

> To get better word matching of natural language text, see "\b{wb}" below.

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

Updates

¹)

DB<35> x "don't" =~ / (.+?) (\b) /x 0 'don' 1 '' DB<36> x "don't" =~ / (.+?) (\b) /xg 0 'don' 1 '' # boundaries always empty 2 '\'' 3 '' 4 't' 5 ''

²) actually this is also more complicated ...

\w [3] Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)

Replies are listed 'Best First'.
Re^2: perlretut - Perl regular expressions tutorial curveball
by afoken (Chancellor) on Apr 11, 2025 at 08:43 UTC

    How to find documentation

    perlretut is a tutorial, you need to lookup details in perlre which is the central reference.

    I just thought: This should be right at the top of perlretut, and a link to perlretut should be right at the top of perlre! I was quite sure it was not. So I quickly looked at the two documents, before posting nonsense. And to my surprise, these two links are already there, they each are the very first link in under "description". Great job!

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^2: perlretut - Perl regular expressions tutorial curveball
by LanX (Saint) on Apr 11, 2025 at 12:49 UTC
    I somehow agree that the example and explanation ("whole string", "language processing") in the tutorial is not well chosen.

    An example showing how to split ...

    • "I don't think 'don't' isn't a word"
    ... into individual words might be better.

    That's actually still not that easy, just Perl complying to Unicode standards doesn't make it trivial

    Edit
    The best I could up with is to use a split on word boundaries, and to discard punctuation and whitespace in a grep.

    DB<13> $str = "I don't think 'don't' isn't > DB<14> x split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ' ' 6 '\'' 7 'don\'t' 8 '\'' 9 ' ' 10 'isn\'t' 11 ' ' 12 'a' 13 ' ' 14 'word' DB<15>

    Update

    Like... ($str expanded with more edge cases)

    DB<27> $str = "I don't think, 'don't' isn't a word..." DB<28> x @list= split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ',' 6 ' ' 7 '\'' 8 'don\'t' 9 '\'' 10 ' ' 11 'isn\'t' 12 ' ' 13 'a' 14 ' ' 15 'word' 16 '.' 17 '.' 18 '.' DB<29> x grep { not /^\W|\s+$/ } @list 0 'I' 1 'don\'t' 2 'think' 3 'don\'t' 4 'isn\'t' 5 'a' 6 'word' DB<30>

    Update

    FWIW grep { ! /^\W+$/ } yield the same result, but I'm not convinced the example is already covering all edge cases...

    Update 2025-04-13

    FWIW

    "Francis' car" is an example for what would still fail. The apostrophe will not be part of the first word after splitting. Admittedly a tough problem.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery