About \d \w and \s

I am currently working on fixing some problems with the current rules for what \d \s and \w should match. It turns out that the current definition/rules lead to logical inconsistencies in the regex engine which cannot be resolved without changing the definitions, and thus breaking something out there.

Unfortunately however, the current behaviour is really close to what people expect: almost all of the time the rules DWIM's nicely. It is only on edge cases, and certain consistency checks do things fall down. This means that any "fixing" of the default rules causes a lot of stuff to break. Which in turn means that we have to do with by adding new modifier flags to control things and leave the defaults alone pretty much.

I am currently working on adding the following set of mutually exclusive flags and behaviour.

Modifier       Semantics             \w               \s             \
+d
   /u          Unicode               \p{IsWord}       \p{IsSpace}    [
+0-9]
   /a          ASCII/Perl            [A-Za-z0-9_]     [ \t\r\n]      [
+0-9] 
   /b          Broken/Legacy         same as perl 5.8                [
+0-9]
   /l          "use locale"          same semantics as under use local
+e in 5.8.x
[download]

Most of this is pretty much a given. The main question is \d under the /b modifier (which will likely be the default). I think it makes a lot of sense to change the default of \d to only be the "computing digits" and not "any digit in unicode". I think it is likely to fix more things than it will break. For you out there working in non-english/latin how much do you depend on \d matching your native digits?

Relevent links: Regarding the new \w regexp escape in 5.11

---
$world=~s/war/peace/g

Back to Meditations