Cow1337killr has asked for the wisdom of the Perl Monks concerning the following question:

Even though I am pretty good at Perl regex, I am learning Perl regex so I can understand the Perl Monks threads on the subject.

I found https://perldoc.perl.org/perlretut entitled perlretut - Perl regular expressions tutorial.

Everything was fine until I came upon this passage in https://perldoc.perl.org/perlretut#Using-character-classes.

(This is a tutorial. It probably belongs in a footnote.)

For natural language processing (so that, for example, apostrophes are + included in words), use instead \b{wb}</p> "don't" =~ / .+? \b{wb} /x; # matches the whole string

What is going on here?

Regex101 is no help.

I wrote a test to try to understand it. It only caused me more confusion.

use warnings; use strict; use feature qw{ say }; if ("don't" =~ / (.+?) (\b{wb}) /x) { # matches the whole string print "It matches\n"; say $1; say $2; } else { print "It doesn't match\n"; } if ("don't" =~ / (.+?) /x) { # It no longer matches the whole string print "It matches\n"; say $1; } else { print "It doesn't match\n"; } Output: It matches don't It matches d

Who is going to attempt natural language processing with a couple of lines of Perl regex in 2025?

Replies are listed 'Best First'.
Re: perlretut - Perl regular expressions tutorial curveball
by LanX (Saint) on Apr 11, 2025 at 02:31 UTC
    \b is the standard word boundary matching between any change from \w to \W and vice versa.

    But a single quote "'" is not an alphanumeric character² in the class \w, but in the opposing \W!

    Hence in the example only "don" would be matched. ¹

    (Actually, to be more precise "don't" should be written with an apostrophe not a quote, but yeah computers you know ;)

    5.22 introduced \b{wb} to "process" characters which appear inside words of "natural languages" like English according to Unicode rules.

    Now "don't" can be matched.

    Hope it's clearer now.

    How to find documentation

    perlretut is a tutorial, you need to lookup details in perlre which is the central reference. After searching "wb" there you'll be delegated to perlrebackslash , which is detailing backslash sequences

    In this case a discussion is found in \b{}, \b, \B{}, \B

    With explicit details in

    > To get better word matching of natural language text, see "\b{wb}" below.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    Updates

    ¹)

    DB<35> x "don't" =~ / (.+?) (\b) /x 0 'don' 1 '' DB<36> x "don't" =~ / (.+?) (\b) /xg 0 'don' 1 '' # boundaries always empty 2 '\'' 3 '' 4 't' 5 ''

    ²) actually this is also more complicated ...

    \w [3] Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)

      How to find documentation

      perlretut is a tutorial, you need to lookup details in perlre which is the central reference.

      I just thought: This should be right at the top of perlretut, and a link to perlretut should be right at the top of perlre! I was quite sure it was not. So I quickly looked at the two documents, before posting nonsense. And to my surprise, these two links are already there, they each are the very first link in under "description". Great job!

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      I somehow agree that the example and explanation ("whole string", "language processing") in the tutorial is not well chosen.

      An example showing how to split ...

      • "I don't think 'don't' isn't a word"
      ... into individual words might be better.

      That's actually still not that easy, just Perl complying to Unicode standards doesn't make it trivial

      Edit
      The best I could up with is to use a split on word boundaries, and to discard punctuation and whitespace in a grep.

      DB<13> $str = "I don't think 'don't' isn't > DB<14> x split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ' ' 6 '\'' 7 'don\'t' 8 '\'' 9 ' ' 10 'isn\'t' 11 ' ' 12 'a' 13 ' ' 14 'word' DB<15>

      Update

      Like... ($str expanded with more edge cases)

      DB<27> $str = "I don't think, 'don't' isn't a word..." DB<28> x @list= split /\b{wb}/, $str 0 'I' 1 ' ' 2 'don\'t' 3 ' ' 4 'think' 5 ',' 6 ' ' 7 '\'' 8 'don\'t' 9 '\'' 10 ' ' 11 'isn\'t' 12 ' ' 13 'a' 14 ' ' 15 'word' 16 '.' 17 '.' 18 '.' DB<29> x grep { not /^\W|\s+$/ } @list 0 'I' 1 'don\'t' 2 'think' 3 'don\'t' 4 'isn\'t' 5 'a' 6 'word' DB<30>

      Update

      FWIW grep { ! /^\W+$/ } yield the same result, but I'm not convinced the example is already covering all edge cases...

      Update 2025-04-13

      FWIW

      "Francis' car" is an example for what would still fail. The apostrophe will not be part of the first word after splitting. Admittedly a tough problem.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

Re: perlretut - Perl regular expressions tutorial curveball
by hippo (Archbishop) on Apr 10, 2025 at 21:28 UTC
    "don't" =~ / .+? \b{wb} /x;  # matches the whole string

    The regex matches from the first character up to the first following word boundary (which is just after the "t") so it matches the entire word and since there is only one word, that's the entire string in this case. Without the \b{wb} it just matches non-greedily 1 or more characters which is just the first "d".

    Who is going to attempt natural language processing with a couple of lines of Perl regex in 2025?

    For a large corpus? Nobody. For one sentence? Probably me. :-P

    Bear in mind also that this example has probably been in perlretut for many, many years.


    🦛

      > Bear in mind also that this example has probably been in perlretut for many, many years.

      The new boundaries were introduced in 5.22 which was released in 2015. So "many, many" = 10.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]