in reply to Re^4: Regex to remove generic accounts
in thread Regex to remove generic accounts

There are more than 100 characters that match /\d/ in 5.10

Does this mean that digits from other languages are also considered as 'digit' by \d? For example, if I have a string consisting of Japanese kanji, would \d match the Kanji digits too?

-- 
Ronald Fischer <ynnor@mm.st>

Replies are listed 'Best First'.
Re^6: Regex to remove generic accounts
by JavaFan (Canon) on Oct 28, 2008 at 10:26 UTC
    Yes, and no. Digits from other languages are matched by \d, but not every language. I think, but I haven't studied the Unicode property database in detail, that if the language uses a strict base-10 system, its digits are matched by \d. But the existance of a "tens" or "hundreds" symbol exclude all its digits from being matched by \d. And it may very well be that the database isn't consistent in this aspect. I don't know what system Japanese uses, but AFAIK, Kanji digits aren't matched by \d.

      Hmmm.... The Kanji for 1-9 (they use "our" 0 for denoting zero) can be used in two ways, one mimics exactly our positional base-10 system, the other one does not (it is easy to see from the way the number is written which of the two usages is being employed). So, if Kanji don't count for \d, can you give me other examples besides 0-9 which are considered digits? Maybe the Greek ordinal symbols? They are at least used in "base 10" fashion.

      -- 
      Ronald Fischer <ynnor@mm.st>
        perl -MConfig -aF';' -nE 'BEGIN {@ARGV = "$Config{privlib}/unicore/Uni +codeData.txt"} say $F[1] if $F[2] eq "Nd"'
        This gives me 290 matches in 5.10.