Re: Remove u200b unicode From String

last time I convert mdict to stardict xml, before convert xml to stardict ifo, need to dump html with w3m, after that \u200b remaining become a problem. I found when use utf8::all, \s will match \u200b, none any \p{..} could match it if donot set 'use utf8::all'.

Comment on Re: Remove u200b unicode From String

Replies are listed 'Best First'.
Re^2: Remove u200b unicode From String by hippo (Archbishop) on Aug 05, 2024 at 08:57 UTC
none any \p{..} could match it if donot set 'use utf8::all'. This is exactly what `\p{Lb=ZW}` matches with no need for uft8::all: `use strict; use warnings; use Test::More tests => 5; my $spacestr = "a\N{ZERO WIDTH SPACE}b"; isnt $spacestr, 'ab', 'String differs from plain "ab"'; is length ($spacestr), 3, 'Length is 3'; is substr ($spacestr, 1, 1), "\x{200B}", 'Middle char is U+200B'; like $spacestr, qr/\p{Lb=ZW}/, '\p{Lb=ZW} matches U+200B'; $spacestr =~ s/\p{Lb=ZW}//; is $spacestr, 'ab', 'After stripping, string is plain "ab"';` [download] This is not to say that this is a good solution to phildeman's problem, of course. Far better just to get the decoding/encoding correct in the first place. 🦛	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Remove u200b unicode From String
by hippo (Archbishop) on Aug 05, 2024 at 08:57 UTC

none any \p{..} could match it if donot set 'use utf8::all'.

This is exactly what \p{Lb=ZW} matches with no need for uft8::all:

use strict;
use warnings;

use Test::More tests => 5;

my $spacestr = "a\N{ZERO WIDTH SPACE}b";

isnt $spacestr, 'ab', 'String differs from plain "ab"';
is length ($spacestr), 3, 'Length is 3';
is substr ($spacestr, 1, 1), "\x{200B}", 'Middle char is U+200B';

like $spacestr, qr/\p{Lb=ZW}/, '\p{Lb=ZW} matches U+200B';

$spacestr =~ s/\p{Lb=ZW}//;
is $spacestr, 'ab', 'After stripping, string is plain "ab"';
[download]

This is not to say that this is a good solution to phildeman's problem, of course. Far better just to get the decoding/encoding correct in the first place.

🦛

[reply]
[d/l]
[select]