Remove u200b unicode From String

phildeman has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I hope you can help me with a dilemma that I have encountered.

I am rendering data in an HTML page. The data is coming from my database (mysql). The database gets updated on a nightly
basis from an external source.

When rendering the title of a book, a few, not many, show a '?'. When I check the database it does not show the '?'.
I copied and pasted the title into VI. VI showed <200b>.

So I inserted a regex substitution as I looped through the list of books. It did not remove it.

This is what I see in HTML, Sustainable B?usiness?
This is what I see in the database, Sustainable Business

I am using Class::DBIx. In the connection package I include 'use utf8;'. In the HTML page, I use <meta charset="UTF-8">.
Yet, it still shows the '?'. I have also tried several substitution Perl regexes. Here are a few:

$booktitle =~ s/\u200b//g;
$booktitle =~ s/\x200b//g;
$booktitle =~ s/\<200b\>//g:
[download]

Has anyone encountered this issue with a hidden unicode character? If yes, how did you remove the unicode character?

Thanks

Comment on Remove u200b unicode From String Download Code

Replies are listed 'Best First'.
Re: Remove u200b unicode From String by hippo (Archbishop) on Jul 24, 2024 at 22:07 UTC
`200b` is the utf-8 ZERO WIDTH SPACE character which is why it is essentially invisible in the database. All you need to do is decode the data when you extract it from the database and encode it when outputing it as HTML then this problem (and those caused by any other utf-8 characters in the data source) just vanishes. I am using Class::DBIx That may or may not be problematic but I would expect not. Check your database connection parameters as you may be able to specify automatic decoding on the fly. DBD::mysql has the `mysql_enable_utf8mb4` option for this. 🦛	[reply] [d/l]
Re: Remove u200b unicode From String by hv (Prior) on Jul 24, 2024 at 23:30 UTC
See previous comments about encoding; but `\x` in a substitution takes exactly two hex digits unless you wrap the hex in braces: `s/\x{200b}//g`. Your second example is thus equivalent to `s/\x{20}0b//g`. This is of course for reasons of backward-compatibility with a pre-Unicode age. `\u` is used for upper-casing a single character (the digit 2 in your first example).	[reply] [d/l] [select]
Re^2: Remove u200b unicode From String by NERDVANA (Priest) on Jul 25, 2024 at 01:14 UTC
In addition, I believe perl must consider the string to be unicode characters before it will be able to match a unicode character from the pattern. Perl will only consider the string to be unicode if the DBD::mysql driver was told to use utf8.	[reply]
Re^2: Remove u200b unicode From String by phildeman (Scribe) on Jul 25, 2024 at 04:30 UTC
Thanks. Unfortunately, using s/\x{200b}//g or s/\x{20}-0b//g did not work. I still get the Sustainable B?usiness?. -Phil-	[reply]
Re: Remove u200b unicode From String by Corion (Patriarch) on Jul 25, 2024 at 07:28 UTC
I think your problem is that it is unclear which encodings your strings have in the database the driver handing the query results to Perl your code the HTML you output In the end, everything is octets, but Perl regular expressions treat a string only as Unicode if it has been properly decoded. The main goal to achieve is consistency, and the ideal goal is to `Encode::decode` the data when you read it (from a file, from the database, ...) and `Encode::encode` it to UTF-8 when you write it to HTML. On the way there, you should inspect the octets of the string, for example using Data::Dumper or Data::Dump to see what octets are in the string and also what Perl thinks the string contains. Ideally, Perl should report it sees `\x{200b}` in the string. If it reports the three bytes `\xE2\x80\x8B` you have the right data, but Perl does not know that the string should be seen as Unicode. You then should `decode` it from `UTF-8`. You should do this inspection for every step of the pipeline.	[reply] [d/l] [select]
Re: Remove u200b unicode From String by ikegami (Patriarch) on Jul 25, 2024 at 02:58 UTC
If the string really does contain the character 0x200B, you can use any of the following: `s/\x{200B}//g s/\N{U+200B}//g s/\N{ZERO WIDTH SPACE}//g` [download] The last may require `use charnames qw( :full );`.	[reply] [d/l] [select]
Re^2: Remove u200b unicode From String by phildeman (Scribe) on Jul 25, 2024 at 03:50 UTC
Thanks for your suggestions. Unfortunately, none of the suggestions worked. I still get the Sustainable B?usiness?. -Phil-	[reply]
Re^3: Remove u200b unicode From String by NERDVANA (Priest) on Jul 25, 2024 at 08:06 UTC
Did you see my comment about the string needing to be recognized as Unicode by perl? If perl is seeing utf8 bytes, it can't match a unicode character. Try printing this: `use B; say B::perlstring($myvalue);` [download]	[reply] [d/l]
Re^3: Remove u200b unicode From String by ikegami (Patriarch) on Jul 25, 2024 at 13:40 UTC
So you don't have character 0x200B. What do you have? You can use `sprintf "%vX", $str` for that.	[reply] [d/l]
Re^3: Remove u200b unicode From String by mldvx4 (Hermit) on Jul 25, 2024 at 07:09 UTC
Just wild guess, but if your data has line breaks then you might try adding the `/m` modifier to the substitution. `s/\x{200B}//gm` [download]	[reply] [d/l] [select]
Re: Remove u200b unicode From String by Danny (Chaplain) on Jul 24, 2024 at 20:52 UTC
If you can print the character to somewhere where you can copy it, e.g. to an xterm, you can just paste it into your regular expression and it should work. For example, using the codepoint 478 which is an A with some dots above: perl -we '$chr = "Ǟ"; $s = "abc" . $chr . "xyz"; print "$s\n"; $s =~ s/$chr/ /g; print "$s\n"' outputs abcǞxyz abc xyz Alternatively, you can do something like the following to find characters outside the ascii range: `use Encode; my $s = get_s_from_somewhere(); my $chars = decode("UTF-8", $s); my %non_ascii; for my $i (0..length($chars)-1) { if( ord(substr($chars, $i, 1)) > 127 ) { $non_ascii{ substr($chars, $i, 1) }++; } } do_something_with_non_ascii(\%non_ascii)` [download]	[reply] [d/l]
Re: Remove u200b unicode From String by vincentaxhe (Scribe) on Aug 05, 2024 at 03:52 UTC
last time I convert mdict to stardict xml, before convert xml to stardict ifo, need to dump html with w3m, after that \u200b remaining become a problem. I found when use utf8::all, \s will match \u200b, none any \p{..} could match it if donot set 'use utf8::all'.	[reply]
Re^2: Remove u200b unicode From String by hippo (Archbishop) on Aug 05, 2024 at 08:57 UTC
none any \p{..} could match it if donot set 'use utf8::all'. This is exactly what `\p{Lb=ZW}` matches with no need for uft8::all: `use strict; use warnings; use Test::More tests => 5; my $spacestr = "a\N{ZERO WIDTH SPACE}b"; isnt $spacestr, 'ab', 'String differs from plain "ab"'; is length ($spacestr), 3, 'Length is 3'; is substr ($spacestr, 1, 1), "\x{200B}", 'Middle char is U+200B'; like $spacestr, qr/\p{Lb=ZW}/, '\p{Lb=ZW} matches U+200B'; $spacestr =~ s/\p{Lb=ZW}//; is $spacestr, 'ab', 'After stripping, string is plain "ab"';` [download] This is not to say that this is a good solution to phildeman's problem, of course. Far better just to get the decoding/encoding correct in the first place. 🦛	[reply] [d/l] [select]