Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

About \d \w and \s

by demerphq (Chancellor)
on Oct 18, 2009 at 16:02 UTC ( #801876=perlmeditation: print w/replies, xml ) Need Help??

I am currently working on fixing some problems with the current rules for what \d \s and \w should match. It turns out that the current definition/rules lead to logical inconsistencies in the regex engine which cannot be resolved without changing the definitions, and thus breaking something out there.

Unfortunately however, the current behaviour is really close to what people expect: almost all of the time the rules DWIM's nicely. It is only on edge cases, and certain consistency checks do things fall down. This means that any "fixing" of the default rules causes a lot of stuff to break. Which in turn means that we have to do with by adding new modifier flags to control things and leave the defaults alone pretty much.

I am currently working on adding the following set of mutually exclusive flags and behaviour.

Modifier Semantics \w \s \ +d /u Unicode \p{IsWord} \p{IsSpace} [ +0-9] /a ASCII/Perl [A-Za-z0-9_] [ \t\r\n] [ +0-9] /b Broken/Legacy same as perl 5.8 [ +0-9] /l "use locale" same semantics as under use local +e in 5.8.x

Most of this is pretty much a given. The main question is \d under the /b modifier (which will likely be the default). I think it makes a lot of sense to change the default of \d to only be the "computing digits" and not "any digit in unicode". I think it is likely to fix more things than it will break. For you out there working in non-english/latin how much do you depend on \d matching your native digits?

Relevent links: Regarding the new \w regexp escape in 5.11


Replies are listed 'Best First'.
Re: About \d \w and \s
by Corion (Patriarch) on Oct 18, 2009 at 16:14 UTC

    Personally, I've avoided relying on unicode/charset semantics with regular expressions. Most of the input I deal with is either Latin-1 or some other "near IBM-ASCII" single byte encoding, and so is my source code. I've made my regular expressions lenient in the sense that I use dots where I expect umlauts.

    Of course, if I were more strict about the encodings of my input data, or Perl were more smart about guessing the encoding of my input data (which is hard without carrying a dictionary of likenesses), I could write my source code and my regular expressions in unicode, and then it would be cool if \w would use the unicode semantics.

    I have no opinion on \d, as German has only 0-9 as digits anyway, and my input data also.

      I believe that /u would provide sane matching for German or other latin-1 languages as it would make perl match according to the unicode rules even when the string/pattern weren't themselves unicode.


Re: About \d \w and \s
by xdg (Monsignor) on Oct 18, 2009 at 17:01 UTC

    My quick reaction is that the default should be /a and people should have to explicitly request /u.

    To the extent it breaks existing code, I think it's in the interest of ensuring that going forward, there are the least amount of surprises for new code being written. If there's a way to "use re 'legacy_unicode'" so people can drop that at the top of their code instead of fixing every individual regex, that seems like an easy enough way to offer back compatibility.

    But I firmly believe the future default should not be the legacy behavior.


    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      My quick reaction is that the default should be /a...
      If there's a way to "use re 'legacy_unicode'"...
      But I firmly believe the future default should not be the legacy behavior.

      !? (/me wondering about --)

      There's enough perl -e one-line filters in shell scripts to make just require an additional 'use' statement a daunting and painful task, merely to retain the originally intended behaviour. So I'd say stick to the ASCII semantics by default, but allow modifiers and use-clauses (file-only scope!?) to modify the behaviour.

      With Unicode proper it seems to be useful to add a single new \w escape that includes latin1 umlauts. Similar arguments can be made for extended whitespace escapes.

      But basically it's impossible to add enough escapes for all kinds of sensible subsets of digit-class characters, whitespace, word-characters or alphanumerical characters: The only sane way to cope is switching to explicit character classes early in our regex.

      And we probably shouldn't even consider wasting an unused character for a new alternative \w-style escape adding latin1-umlauts (says a German, who is using those umlauts all the time).

      adding another 0.01 and staying firmly in the ASCII semantics camp,

Re: About \d \w and \s
by jwkrahn (Monsignor) on Oct 19, 2009 at 01:54 UTC
    /a ASCII/Perl [A-Za-z0-9_] [ \t\r\n] [ +0-9]

    That should be:

    /a ASCII/Perl [A-Za-z0-9_] [ \t\f\r\n] + [0-9]

    \f is also in the \s character class.

      Good catch.


Re: About \d \w and \s
by moritz (Cardinal) on Oct 18, 2009 at 19:17 UTC
    I like the ideas of having /a, /u and /l modifiers. And while I have nothing against the fourth one, I question its value.

    The 5.8 behavior is rather confusing, and I remember a lengthy discussion with ikegami where we tried to figure out the details, and it took us quite some time. Also it seemed to have changed between different maintenance releases of 5.8 (not quite sure on this one), so I think that nearly nobody relies willingly on these semantics.

    I'd also love to see a way to set the default in a pragma. I envision a pragma that enables as much Unicode behavior as possible, something like use utf8::all; or so, which imports, use open ':encoding(utf8)'; and the pragma that enables /u by default.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: About \d \w and \s
by mirod (Canon) on Oct 18, 2009 at 16:17 UTC

    I think you're right, \d should be strictly equivalent to [0-9]. That's the way it worked pre-unicode, and I suspect a lot of code still uses it this way. The author would be quite surprised to see that their regexp actually matches non-traditional digits, and it could be a potential security problem.

    I don't really like the /b, for broken, modifier. Maybe /t (traditional?) or /c (classical) if they aren't already used (I don't believe they are).

      Unfortuantely /c is taken for /gc matches. My problem with "traditional" is that the term "traditional" is how I have been thinking of the /a variant, which makes things match the way perl did before it supported unicode. But maybe the difference is one is "Perl-traditional" and the other is "Perl-with-unicode-traditional". I dont know. Got any other ideas?

      BTW you can see the ones that are taken below:

      /* chars and strings used as regex pattern modifiers * Singlular is a 'c'har, plural is a "string" * * NOTE, KEEPCOPY was originally 'k', but was changed to 'p' for prese +rve * for compatibility reasons with Regexp::Common which highjacked (?k: +...) * for its own uses. So 'k' is out as well. */ #define EXEC_PAT_MOD 'e' #define KEEPCOPY_PAT_MOD 'p' #define ONCE_PAT_MOD 'o' #define GLOBAL_PAT_MOD 'g' #define CONTINUE_PAT_MOD 'c' #define MULTILINE_PAT_MOD 'm' #define SINGLE_PAT_MOD 's' #define IGNORE_PAT_MOD 'i' #define XTENDED_PAT_MOD 'x' #define BROKEN_SEM_PAT_MOD 'b' #define LOCALE_SEM_PAT_MOD 'l' #define PERL_SEM_PAT_MOD 'a' #define UNI_SEM_PAT_MOD 'u' #define ONCE_PAT_MODS "o" #define KEEPCOPY_PAT_MODS "p" #define EXEC_PAT_MODS "e" #define LOOP_PAT_MODS "gc" #define STD_PAT_MODS "msix" #define SEM_PAT_MODS "blau" #define INT_PAT_MODS STD_PAT_MODS KEEPCOPY_PAT_MODS #define EXT_PAT_MODS ONCE_PAT_MODS KEEPCOPY_PAT_MODS #define QR_PAT_MODS STD_PAT_MODS EXT_PAT_MODS SEM_PAT_MODS #define M_PAT_MODS QR_PAT_MODS LOOP_PAT_MODS #define S_PAT_MODS M_PAT_MODS EXEC_PAT_MODS

        Since legacy is the default, I'd expect the flag to be explicitly named only rarely. Use "L" for legacy. If you can't stand the use of the shift key (even rarely), perhaps "h" for "historical."

        (I agree with mirod, the old behavior is not "broken.")

        Update: Maybe describing what it does is too hard. Just call it "d" for "default"!

      I changed it to /t for "traditional" in the source now.


Re: About \d \w and \s
by graff (Chancellor) on Oct 19, 2009 at 06:43 UTC
    For you out there working in non-english/latin how much do you depend on \d matching your native digits?

    I suspect I've done this a few times at least, w.r.t the "Arabic-Indic" digits (U+0660 - U+0669 and even U+06F0 - U+06F9); some Arabic typists have a pernicious tendency to use these as well as ASCII digits in a single document.

    Folks working in Chinese/Japanese/Korean often see the "full-width digits" (U+FF10 - U+FF19) -- and these can also show up in the same document with ASCII digits.

    I'm all for greater consistency in regex semantics, but in that regard, it strikes me as very odd (and probably unfortunate) that /\d/u would not be equivalent to /\p{IsDigit}/, in contrast to what /u does for the \s and \w escapes. (What about "\b", by the way?)

    I think it's usually the case, when doing regex matching on non-Latin text, that the primary task is to segregate text into functional (linguistic) categories: word strings vs. digit strings vs. punctuation strings vs... Once that's done, we might want to do different things with the different chunks (like normalizing digit strings).

    If I happen to be working with non-Latin data that uses mixed digits, I think I'd rather error out on finding that some "/\d+/u" strings are not suitable for doing arithmetic, rather than never finding out that I'm missing the non-ASCII digit strings altogether because they didn't match "/\d+/u".

    If there's no "/u" modifier, and I always have to use perlunicode escapes in regexes in order to match unicode character class equivalents of \s \w \d, okay fine, I'll use \pZ [\pL\pM] \pN (or \p{IsSpace} \p{IsWord} \p{IsDigit} if I sense a need to code verbosely).

    But if there's going to be a "/u" modifier, I think it would be more consistent (less surprising/annoying) to have it treat \d the same as \w and \s (and \b, for that matter), especially since "\d" is normally understood to be a subset of \w, and with a /u modifier, \w would include non-ASCII digits.

    If someone has to face the task of doing arithmetic on potentially mixed digit strings, it won't be long before we have a CPAN module for this (maybe there's one already?), and testing for non-ASCII digits would be pretty simple:

    if ( /^\d+$/u and not /^\d+$/ ) { # need to normalize this non-ASCII digit string before doing arith +metic... }
Re: About \d \w and \s
by JavaFan (Canon) on Oct 18, 2009 at 20:12 UTC
    I'm really surprised that \d under the Unicode semantics is just [0-9]. I would expect that if \w is \p{IsWord}, and \s is \p{IsSpace}, that then \d is \p{IsDigit}.

    And I don't get /b at all. I would expect that if I use /b because I have a regexp that I want to behave exactly as in 5.8, I really want that. But /b isn't giving me that, as \d matches a lot more than [0-9] in 5.8. I appreciate changing the \d to [0-9], that gets my vote, but I don't get a broken "broken/legacy" switch.

    Third, what's the deal with /a and \s? Currently, \s matches 5 ASCII code points, but under /a it's going to match four?

Re: About \d \w and \s
by ambrus (Abbot) on Oct 19, 2009 at 08:59 UTC

    To clarify things, does the unicode variant treat byte strings as if they were iso-8859-1 encoded? (There's also the question of how the use locale variant treats character strings, currently it assumes the string was accidentally iso-8859-1 decoded except where it has characters with code higher than 255, but it's probably always an error to actually depend on this so it doesn't matter.)

    Strangely, it seems I don't have any obfus that use syntax like m/foobar/and (the closest I have is y//or in Ode for getprotobyname) so for a change this will be a new feature of perl core that does not break any of my obfus.

      If I remember what iso-8859-1 is then i think so yes. In simple terms the rules will be those of unicode even tho the representation of the codepoints is bytes. In other words the matching would behave the same as would occur if you did a utf8::upgrade() on it before the match.

      How the regex engine works under use local will not be changed, except that it won't be "all or nothing", you will be able to turn it for sections of a pattern. I dont pretend to understand the use locale mode and I dont plan to do much with it. (Id like it if use locale "went away" actually.)


Re: About \d \w and \s
by Anonymous Monk on Oct 18, 2009 at 16:50 UTC

    The /b behaviour is inconsistent, and I don't think we should be supporting it indefinitely. We should abandon it entirely. The /a, /u, and /l behaviours are all sane, and if the meaning of \w et al are going to be variable then modifiers are the right way to do it. As for what should be the default: apparently the rationale for having variable meaning is that it's too much trouble to change all the code that's expecting Unicode behaviour, so the only sensible default in that case is /u. I suppose this means that we're resigned to amending all the existing code that wants ASCII behaviour, which is funny because I thought there was a lot more code in the latter category than the former. If changing Unicode-based code is *not* too onerous, I'd favour /a being the default, and possibly not doing the modifier thing at all.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://801876]
Approved by Corion
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (1)
As of 2022-07-03 20:50 GMT
Find Nodes?
    Voting Booth?

    No recent polls found