perlretut - Perl regular expressions tutorial curveball

Cow1337killr has asked for the wisdom of the Perl Monks concerning the following question:

Even though I am pretty good at Perl regex, I am learning Perl regex so I can understand the Perl Monks threads on the subject.

I found https://perldoc.perl.org/perlretut entitled perlretut - Perl regular expressions tutorial.

Everything was fine until I came upon this passage in https://perldoc.perl.org/perlretut#Using-character-classes.

(This is a tutorial. It probably belongs in a footnote.)

For natural language processing (so that, for example, apostrophes are
+ included in words), use instead \b{wb}</p>

"don't" =~ / .+? \b{wb} /x;  # matches the whole string
[download]

What is going on here?

Regex101 is no help.

I wrote a test to try to understand it. It only caused me more confusion.

use warnings;
use strict;
use feature qw{ say };

if ("don't" =~ / (.+?) (\b{wb}) /x) {  # matches the whole string
    print "It matches\n";
    say $1;
    say $2;
}
else {
    print "It doesn't match\n";
}

if ("don't" =~ / (.+?) /x) {  # It no longer matches the whole string
    print "It matches\n";
    say $1;
}
else {
    print "It doesn't match\n";
}

Output:
It matches
don't

It matches
d
[download]

Who is going to attempt natural language processing with a couple of lines of Perl regex in 2025?

Comment on perlretut - Perl regular expressions tutorial curveball Select or Download Code

Replies are listed 'Best First'.
Re: perlretut - Perl regular expressions tutorial curveball by LanX (Saint) on Apr 11, 2025 at 02:31 UTC
\b is the standard word boundary matching between any change from \w to \W and vice versa. But a single quote `"'"` is not an alphanumeric character˛ in the class \w, but in the opposing \W! Hence in the example only "don" would be matched. š (Actually, to be more precise "don't" should be written with an apostrophe not a quote, but yeah computers you know ;) 5.22 introduced `\b{wb}` to "process" characters which appear inside words of "natural languages" like English according to Unicode rules. Now "don't" can be matched. Hope it's clearer now. How to find documentation `perlretut` is a tutorial, you need to lookup details in `perlre` which is the central reference. After searching "wb" there you'll be delegated to `perlrebackslash` , which is detailing backslash sequences In this case a discussion is found in \b{}, \b, \B{}, \B With explicit details in > To get better word matching of natural language text, see "\b{wb}" below. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} Updates š) `DB<35> x "don't" =~ / (.+?) (\b) /x 0 'don' 1 '' DB<36> x "don't" =~ / (.+?) (\b) /xg 0 'don' 1 '' # boundaries always empty 2 '\'' 3 '' 4 't' 5 ''` [download] ˛) actually this is also more complicated ... `\w [3] Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)` [download]	[reply] [d/l] [select]
Re^2: perlretut - Perl regular expressions tutorial curveball by afoken (Chancellor) on Apr 11, 2025 at 08:43 UTC
How to find documentation perlretut is a tutorial, you need to lookup details in perlre which is the central reference. I just thought: This should be right at the top of perlretut, and a link to perlretut should be right at the top of perlre! I was quite sure it was not. So I quickly looked at the two documents, before posting nonsense. And to my surprise, these two links are already there, they each are the very first link in under "description". Great job! Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: perlretut - Perl regular expressions tutorial curveball by LanX (Saint) on Apr 11, 2025 at 12:49 UTC
I somehow agree that the example and explanation ("whole string", "language processing") in the tutorial is not well chosen. An example showing how to split ... `"I don't think 'don't' isn't a word"` ... into individual words might be better. That's actually still not that easy, just Perl complying to Unicode standards doesn't make it trivial Edit The best I could up with is to use a split on word boundaries, and to discard punctuation and whitespace in a grep. `DB<13> $str = "I don't think 'don't' isn't > DB<14> x split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ' ' 6 '\'' 7 'don\'t' 8 '\'' 9 ' ' 10 'isn\'t' 11 ' ' 12 'a' 13 ' ' 14 'word' DB<15>` [download] Update Like... ($str expanded with more edge cases) `DB<27> $str = "I don't think, 'don't' isn't a word..." DB<28> x @list= split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ',' 6 ' ' 7 '\'' 8 'don\'t' 9 '\'' 10 ' ' 11 'isn\'t' 12 ' ' 13 'a' 14 ' ' 15 'word' 16 '.' 17 '.' 18 '.' DB<29> x grep { not /^\W\|\s+$/ } @list 0 'I' 1 'don\'t' 2 'think' 3 'don\'t' 4 'isn\'t' 5 'a' 6 'word' DB<30>` [download] Update FWIW `grep { ! /^\W+$/ }` yield the same result, but I'm not convinced the example is already covering all edge cases... Update 2025-04-13 FWIW `"Francis' car"` is an example for what would still fail. The apostrophe will not be part of the first word after splitting. Admittedly a tough problem. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: perlretut - Perl regular expressions tutorial curveball by hippo (Archbishop) on Apr 10, 2025 at 21:28 UTC
`"don't" =~ / .+? \b{wb} /x; # matches the whole string` The regex matches from the first character up to the first following word boundary (which is just after the "t") so it matches the entire word and since there is only one word, that's the entire string in this case. Without the `\b{wb}` it just matches non-greedily 1 or more characters which is just the first "d". Who is going to attempt natural language processing with a couple of lines of Perl regex in 2025? For a large corpus? Nobody. For one sentence? Probably me. :-P Bear in mind also that this example has probably been in perlretut for many, many years. 🦛	[reply] [d/l] [select]
Re^2: perlretut - Perl regular expressions tutorial curveball by choroba (Cardinal) on Apr 10, 2025 at 21:33 UTC
> Bear in mind also that this example has probably been in perlretut for many, many years. The new boundaries were introduced in 5.22 which was released in 2015. So "many, many" = 10. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]

How to find documentation

Updates

Edit

Update

Update

Update 2025-04-13