in reply to Remove u200b unicode From String

last time I convert mdict to stardict xml, before convert xml to stardict ifo, need to dump html with w3m, after that \u200b remaining become a problem. I found when use utf8::all, \s will match \u200b, none any \p{..} could match it if donot set 'use utf8::all'.

Replies are listed 'Best First'.
Re^2: Remove u200b unicode From String
by hippo (Archbishop) on Aug 05, 2024 at 08:57 UTC
    none any \p{..} could match it if donot set 'use utf8::all'.

    This is exactly what \p{Lb=ZW} matches with no need for uft8::all:

    use strict; use warnings; use Test::More tests => 5; my $spacestr = "a\N{ZERO WIDTH SPACE}b"; isnt $spacestr, 'ab', 'String differs from plain "ab"'; is length ($spacestr), 3, 'Length is 3'; is substr ($spacestr, 1, 1), "\x{200B}", 'Middle char is U+200B'; like $spacestr, qr/\p{Lb=ZW}/, '\p{Lb=ZW} matches U+200B'; $spacestr =~ s/\p{Lb=ZW}//; is $spacestr, 'ab', 'After stripping, string is plain "ab"';

    This is not to say that this is a good solution to phildeman's problem, of course. Far better just to get the decoding/encoding correct in the first place.


    🦛