in reply to Remove unicode "whitespace"
I have a number of strings which terminate in the unicode character E2 80 8E.
There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some *unicode integer* (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM.
this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE.
Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings.
The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character:
Here's a list of the 139 characters in the FORMAT category.use strict; use warnings; use 5.012; say hex("200E"); #8206 my $str = "hello\N{LRM}"; if ($str =~ / hello ( #Start of $1 \p{FORMAT} #One char in Unicode FORMAT category ) #End of $1 /xms) { #Standard flags say ord($1); #8206 }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Remove unicode "whitespace"
by HYanWong (Acolyte) on Feb 28, 2013 at 11:20 UTC | |
|
Re^2: Remove unicode "whitespace"
by Ratazong (Monsignor) on Feb 28, 2013 at 07:49 UTC | |
by daxim (Curate) on Feb 28, 2013 at 09:57 UTC |