Re: Remove unicode "whitespace"

I have a number of strings which terminate in the unicode character E2 80 8E.

There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some *unicode integer* (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM.

this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE.

Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings.

The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character:

use strict;
use warnings;
use 5.012;

say hex("200E");    #8206

my $str = "hello\N{LRM}";

if ($str =~ /
     hello 
     (                 #Start of $1
         \p{FORMAT}    #One char in Unicode FORMAT category
     )                 #End of $1

     /xms) {           #Standard flags 

    say ord($1);     #8206
}
[download]

Here's a list of the 139 characters in the FORMAT category.

Comment on Re: Remove unicode "whitespace" Download Code

Replies are listed 'Best First'.
Re^2: Remove unicode "whitespace" by HYanWong (Acolyte) on Feb 28, 2013 at 11:20 UTC
Great. Thanks for the tip about `\p{FORMAT}`, and the correction about Unicode terminology. I'll try stripping my strings using `/[\s\p{FORMAT}]*$//g` then.	[reply] [d/l] [select]
Re^2: Remove unicode "whitespace" by Ratazong (Monsignor) on Feb 28, 2013 at 07:49 UTC
Unicode does not include LRM in the 26 characters it considers whitespace Just for completeness: the list what is considered whitespace can be found here (sub-section white space).	[reply]
Re^3: Remove unicode "whitespace" by daxim (Curate) on Feb 28, 2013 at 09:57 UTC
That link is out of date. The always current version is at http://p3rl.org/recharclass.	[reply]