Remove unicode "whitespace"

HYanWong has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove unicode "whitespace" by 7stud (Deacon) on Feb 28, 2013 at 03:36 UTC
I have a number of strings which terminate in the unicode character E2 80 8E. There's no such thing as unicode character E2 80 8E. That is a UTF-8 encoding, i.e. a secret code, for some unicode integer (where the unicode integer represents some character in some language). The unicode integer is actually U+200E, which represents the character LRM. this conforms to my expectation of "whitespace", but of course, it doesn't match \s in a RE. Unicode does not include LRM in the 26 characters it considers whitespace, so that is the final word on what will match \s. Your challenge is going to be to elucidate the category of characters that you want to strip off the end of your strings. The unicode FORMAT category (invisible formatting indicators) does encompass the LRM character: `use strict; use warnings; use 5.012; say hex("200E"); #8206 my $str = "hello\N{LRM}"; if ($str =~ / hello ( #Start of $1 \p{FORMAT} #One char in Unicode FORMAT category ) #End of $1 /xms) { #Standard flags say ord($1); #8206 }` [download] Here's a list of the 139 characters in the FORMAT category.	[reply] [d/l]
Re^2: Remove unicode "whitespace" by HYanWong (Acolyte) on Feb 28, 2013 at 11:20 UTC
Great. Thanks for the tip about `\p{FORMAT}`, and the correction about Unicode terminology. I'll try stripping my strings using `/[\s\p{FORMAT}]*$//g` then.	[reply] [d/l] [select]
Re^2: Remove unicode "whitespace" by Ratazong (Monsignor) on Feb 28, 2013 at 07:49 UTC
Unicode does not include LRM in the 26 characters it considers whitespace Just for completeness: the list what is considered whitespace can be found here (sub-section white space).	[reply]
Re^3: Remove unicode "whitespace" by daxim (Curate) on Feb 28, 2013 at 09:57 UTC
That link is out of date. The always current version is at http://p3rl.org/recharclass.	[reply]
Re: Remove unicode "whitespace" by Khen1950fx (Canon) on Feb 28, 2013 at 05:23 UTC
Are you sure that LRM is just "whitespace"? I did some googling, and I'm getting a different take on it. As I understand it, LRM is a bidirectional, zero-width character that is necessary for determining text-direction of mixed data, using the Bidi algorithm. If that's true, I could be wrong, then you don't want to trim the LRM's from the links. ikegami could probably explain it better:-).	[reply]
Re^2: Remove unicode "whitespace" by HYanWong (Acolyte) on Feb 28, 2013 at 11:16 UTC
You're right that it is the LRM character, and so shouldn't be stripped in general (so it's sensible that it doesn't match \s). But it is useless at the end of a string, hence my suggestion that it should be considered something like whitespace in that context. I hoped there might be a function to trim the end of strings for this specific purpose. Or if not, something generic I could add to a RE to strip unicode characters of this nature.	[reply]
Re^3: Remove unicode "whitespace" by Khen1950fx (Canon) on Feb 28, 2013 at 16:11 UTC
Give URI::Encode a try. `#!usr/bin/perl -l use strict; use warnings; use URI::Encode qw(uri_decode); my $encoded = 'http://commons.wikimedia.org /wiki/File:Atelerix_algirus.jpg%E2%80%8E'; print uri_decode($encoded);` [download]	[reply] [d/l]
Re^4: Remove unicode "whitespace" by HYanWong (Acolyte) on Mar 01, 2013 at 01:44 UTC
Re: Remove unicode "whitespace" by ikegami (Patriarch) on Mar 01, 2013 at 10:25 UTC
$ unichars -au '\s' ---- U+00009 CHARACTER TABULATION ---- U+0000A LINE FEED (LF) ---- U+0000C FORM FEED (FF) ---- U+0000D CARRIAGE RETURN (CR) ---- U+00020 SPACE ---- U+00085 NEXT LINE (NEL) ---- U+000A0 NO-BREAK SPACE ---- U+01680 OGHAM SPACE MARK ---- U+0180E MONGOLIAN VOWEL SEPARATOR ---- U+02000 EN QUAD ---- U+02001 EM QUAD ---- U+02002 EN SPACE ---- U+02003 EM SPACE ---- U+02004 THREE-PER-EM SPACE ---- U+02005 FOUR-PER-EM SPACE ---- U+02006 SIX-PER-EM SPACE ---- U+02007 FIGURE SPACE ---- U+02008 PUNCTUATION SPACE ---- U+02009 THIN SPACE ---- U+0200A HAIR SPACE ---- U+02028 LINE SEPARATOR ---- U+02029 PARAGRAPH SEPARATOR ---- U+0202F NARROW NO-BREAK SPACE ---- U+0205F MEDIUM MATHEMATICAL SPACE ---- U+03000 IDEOGRAPHIC SPACE $ uniprops -a U+200E U+200E ‹U+200E› \N{LEFT-TO-RIGHT MARK} \pC \p{Cf} All Any Assigned Bidi_C Bidi_Control BidiC InGeneralPunctuation C Other Case_Ignorable CI Cf Format Changes_When_NFKC_Casefolded CWKCF Common Zyyy Default_Ignorable_Code_Point DI General_Punctuation Graph Pat_WS Pattern_White_Space PatWS Print X_POSIX_Graph X_POSIX_Print Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=General_Punctuation Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=CN Grapheme_Cluster_Break=Control GCB=CN Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=FO Sentence_Break=Format SB=FO Word_Break=FO Word_Break=Format WB=FO _Case_Ignorable [download]	[reply] [d/l]