phildeman has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I hope you can help me with a dilemma that I have encountered.

I am rendering data in an HTML page. The data is coming from my database (mysql). The database gets updated on a nightly
basis from an external source.

When rendering the title of a book, a few, not many, show a '?'. When I check the database it does not show the '?'.
I copied and pasted the title into VI. VI showed <200b>.

So I inserted a regex substitution as I looped through the list of books. It did not remove it.

This is what I see in HTML, Sustainable B?usiness?
This is what I see in the database, Sustainable Business

I am using Class::DBIx. In the connection package I include 'use utf8;'. In the HTML page, I use <meta charset="UTF-8">.
Yet, it still shows the '?'. I have also tried several substitution Perl regexes. Here are a few:

$booktitle =~ s/\u200b//g; $booktitle =~ s/\x200b//g; $booktitle =~ s/\<200b\>//g:

Has anyone encountered this issue with a hidden unicode character? If yes, how did you remove the unicode character?

Thanks

Replies are listed 'Best First'.
Re: Remove u200b unicode From String
by hippo (Archbishop) on Jul 24, 2024 at 22:07 UTC

    200b is the utf-8 ZERO WIDTH SPACE character which is why it is essentially invisible in the database. All you need to do is decode the data when you extract it from the database and encode it when outputing it as HTML then this problem (and those caused by any other utf-8 characters in the data source) just vanishes.

    I am using Class::DBIx

    That may or may not be problematic but I would expect not. Check your database connection parameters as you may be able to specify automatic decoding on the fly. DBD::mysql has the mysql_enable_utf8mb4 option for this.


    🦛

Re: Remove u200b unicode From String
by hv (Prior) on Jul 24, 2024 at 23:30 UTC

    See previous comments about encoding; but \x in a substitution takes exactly two hex digits unless you wrap the hex in braces: s/\x{200b}//g. Your second example is thus equivalent to s/\x{20}0b//g. This is of course for reasons of backward-compatibility with a pre-Unicode age.

    \u is used for upper-casing a single character (the digit 2 in your first example).

      In addition, I believe perl must consider the string to be unicode characters before it will be able to match a unicode character from the pattern. Perl will only consider the string to be unicode if the DBD::mysql driver was told to use utf8.

      Thanks. Unfortunately, using s/\x{200b}//g or s/\x{20}-0b//g did not work. I still get the Sustainable B?usiness?.

      -Phil-
Re: Remove u200b unicode From String
by Corion (Patriarch) on Jul 25, 2024 at 07:28 UTC

    I think your problem is that it is unclear which encodings your strings have in

    • the database
    • the driver handing the query results to Perl
    • your code
    • the HTML you output

    In the end, everything is octets, but Perl regular expressions treat a string only as Unicode if it has been properly decoded.

    The main goal to achieve is consistency, and the ideal goal is to Encode::decode the data when you read it (from a file, from the database, ...) and Encode::encode it to UTF-8 when you write it to HTML.

    On the way there, you should inspect the octets of the string, for example using Data::Dumper or Data::Dump to see what octets are in the string and also what Perl thinks the string contains. Ideally, Perl should report it sees \x{200b} in the string. If it reports the three bytes \xE2\x80\x8B you have the right data, but Perl does not know that the string should be seen as Unicode. You then should decode it from UTF-8.

    You should do this inspection for every step of the pipeline.

Re: Remove u200b unicode From String
by ikegami (Patriarch) on Jul 25, 2024 at 02:58 UTC

    If the string really does contain the character 0x200B, you can use any of the following:

    s/\x{200B}//g s/\N{U+200B}//g s/\N{ZERO WIDTH SPACE}//g

    The last may require use charnames qw( :full );.

      Thanks for your suggestions. Unfortunately, none of the suggestions worked. I still get the Sustainable B?usiness?.

      -Phil-
        Did you see my comment about the string needing to be recognized as Unicode by perl? If perl is seeing utf8 bytes, it can't match a unicode character.

        Try printing this:

        use B; say B::perlstring($myvalue);

        So you don't have character 0x200B. What do you have? You can use sprintf "%vX", $str for that.

        Just wild guess, but if your data has line breaks then you might try adding the /m modifier to the substitution.

        s/\x{200B}//gm
Re: Remove u200b unicode From String
by Danny (Chaplain) on Jul 24, 2024 at 20:52 UTC
    If you can print the character to somewhere where you can copy it, e.g. to an xterm, you can just paste it into your regular expression and it should work. For example, using the codepoint 478 which is an A with some dots above:
    perl -we '$chr = "Ǟ"; $s = "abc" . $chr . "xyz"; print "$s\n"; $s =~ s/$chr/ /g; print "$s\n"'
    
    outputs
    abcǞxyz
    abc xyz
    
    Alternatively, you can do something like the following to find characters outside the ascii range:
    use Encode; my $s = get_s_from_somewhere(); my $chars = decode("UTF-8", $s); my %non_ascii; for my $i (0..length($chars)-1) { if( ord(substr($chars, $i, 1)) > 127 ) { $non_ascii{ substr($chars, $i, 1) }++; } } do_something_with_non_ascii(\%non_ascii)
Re: Remove u200b unicode From String
by vincentaxhe (Scribe) on Aug 05, 2024 at 03:52 UTC
    last time I convert mdict to stardict xml, before convert xml to stardict ifo, need to dump html with w3m, after that \u200b remaining become a problem. I found when use utf8::all, \s will match \u200b, none any \p{..} could match it if donot set 'use utf8::all'.
      none any \p{..} could match it if donot set 'use utf8::all'.

      This is exactly what \p{Lb=ZW} matches with no need for uft8::all:

      use strict; use warnings; use Test::More tests => 5; my $spacestr = "a\N{ZERO WIDTH SPACE}b"; isnt $spacestr, 'ab', 'String differs from plain "ab"'; is length ($spacestr), 3, 'Length is 3'; is substr ($spacestr, 1, 1), "\x{200B}", 'Middle char is U+200B'; like $spacestr, qr/\p{Lb=ZW}/, '\p{Lb=ZW} matches U+200B'; $spacestr =~ s/\p{Lb=ZW}//; is $spacestr, 'ab', 'After stripping, string is plain "ab"';

      This is not to say that this is a good solution to phildeman's problem, of course. Far better just to get the decoding/encoding correct in the first place.


      🦛