Re: perlretut - Perl regular expressions tutorial curveball

\b is the standard word boundary matching between any change from \w to \W and vice versa.

But a single quote "'" is not an alphanumeric character˛ in the class \w, but in the opposing \W!

Hence in the example only "don" would be matched. š

(Actually, to be more precise "don't" should be written with an apostrophe not a quote, but yeah computers you know ;)

5.22 introduced \b{wb} to "process" characters which appear inside words of "natural languages" like English according to Unicode rules.

Now "don't" can be matched.

Hope it's clearer now.

How to find documentation

perlretut is a tutorial, you need to lookup details in perlre which is the central reference. After searching "wb" there you'll be delegated to perlrebackslash , which is detailing backslash sequences

In this case a discussion is found in \b{}, \b, \B{}, \B

With explicit details in

> To get better word matching of natural language text, see "\b{wb}" below.

Cheers Rolf
_{(addicted to the Perl Programming Language :)

see Wikisyntax for the Monastery}

Updates

š)

DB<35> x "don't" =~ / (.+?) (\b) /x
0  'don'
1  ''
  DB<36> x "don't" =~ / (.+?) (\b) /xg
0  'don'
1  ''            # boundaries always empty
2  '\''
3  ''
4  't'
5  ''
[download]

˛) actually this is also more complicated ...

\w        [3]  Match a "word" character (alphanumeric plus "_", plus
                  other connector punctuation chars plus Unicode
                  marks)
[download]

Comment on Re: perlretut - Perl regular expressions tutorial curveball Select or Download Code

Replies are listed 'Best First'.
Re^2: perlretut - Perl regular expressions tutorial curveball by afoken (Chancellor) on Apr 11, 2025 at 08:43 UTC
How to find documentation perlretut is a tutorial, you need to lookup details in perlre which is the central reference. I just thought: This should be right at the top of perlretut, and a link to perlretut should be right at the top of perlre! I was quite sure it was not. So I quickly looked at the two documents, before posting nonsense. And to my surprise, these two links are already there, they each are the very first link in under "description". Great job! Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: perlretut - Perl regular expressions tutorial curveball by LanX (Saint) on Apr 11, 2025 at 12:49 UTC
I somehow agree that the example and explanation ("whole string", "language processing") in the tutorial is not well chosen. An example showing how to split ... `"I don't think 'don't' isn't a word"` ... into individual words might be better. That's actually still not that easy, just Perl complying to Unicode standards doesn't make it trivial Edit The best I could up with is to use a split on word boundaries, and to discard punctuation and whitespace in a grep. `DB<13> $str = "I don't think 'don't' isn't > DB<14> x split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ' ' 6 '\'' 7 'don\'t' 8 '\'' 9 ' ' 10 'isn\'t' 11 ' ' 12 'a' 13 ' ' 14 'word' DB<15>` [download] Update Like... ($str expanded with more edge cases) `DB<27> $str = "I don't think, 'don't' isn't a word..." DB<28> x @list= split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ',' 6 ' ' 7 '\'' 8 'don\'t' 9 '\'' 10 ' ' 11 'isn\'t' 12 ' ' 13 'a' 14 ' ' 15 'word' 16 '.' 17 '.' 18 '.' DB<29> x grep { not /^\W\|\s+$/ } @list 0 'I' 1 'don\'t' 2 'think' 3 'don\'t' 4 'isn\'t' 5 'a' 6 'word' DB<30>` [download] Update FWIW `grep { ! /^\W+$/ }` yield the same result, but I'm not convinced the example is already covering all edge cases... Update 2025-04-13 FWIW `"Francis' car"` is an example for what would still fail. The apostrophe will not be part of the first word after splitting. Admittedly a tough problem. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]

How to find documentation

Updates

Edit

Update

Update

Update 2025-04-13