frevo has asked for the wisdom of the Perl Monks concerning the following question:
I need to scan source code from a 7-bit ascii file and convert hex encodings of Unicode code points into UTF-8 characters. The output should be a correctly encoded UTF-8 string if it contains any code points > 127. The following snippet does not work. Strings 1 and 4 are OK. String 2 is not converted correctly. String 3 is converted correctly! The code points in the range 128..255 are not converted correctly, unless there is a code point > 255 in the same string.
Try it out. The "Dump" statement shows that String 2 is not UTF-8 and the hex characters have not been encoded as UTF-8. You may want to view STDOUT in a UTF-8-aware viewer. PerlMonks' filtering makes it look funny if I include it here.
I know Perl handles 128..255 a little differently, but there must be some workaround.
I have tried variations on Encode::decode and utf8::upgrade to no avail. Any suggestions how to convert the 128..255 characters?
use Devel::Peek; my @strings = ( 'Panic Button', # String 1 'Bot\U00F3n de P\U00E1nico', # String 2 'Bot\U00F3n de P\U00E1nico\U200B', # String 3 '\U041a\U043d\U043e\U043f\U043a\U0430' . ' \U043f\U0430\U043d\U0438\U043a\U0438', # String 4 ); for my $string (@strings){ print STDERR qq(\n$string\n); $string =~ s~ \\U ( [0-9a-fA-F]{4,4} ) ~ chr(hex "0x$1"); ~gex; print "$string\n"; Dump $string; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Convert hex to UTF-8
by ikegami (Patriarch) on Sep 27, 2008 at 04:05 UTC | |
|
Re: Convert hex to UTF-8
by massa (Hermit) on Sep 27, 2008 at 02:17 UTC | |
|
Re: Convert hex to UTF-8
by JavaFan (Canon) on Sep 27, 2008 at 02:11 UTC | |
|
Re: Convert hex to UTF-8
by Anonymous Monk on Feb 10, 2014 at 10:15 UTC |