comment on

I need to scan source code from a 7-bit ascii file and convert hex encodings of Unicode code points into UTF-8 characters. The output should be a correctly encoded UTF-8 string if it contains any code points > 127. The following snippet does not work. Strings 1 and 4 are OK. String 2 is not converted correctly. String 3 is converted correctly! The code points in the range 128..255 are not converted correctly, unless there is a code point > 255 in the same string.

Try it out. The "Dump" statement shows that String 2 is not UTF-8 and the hex characters have not been encoded as UTF-8. You may want to view STDOUT in a UTF-8-aware viewer. PerlMonks' filtering makes it look funny if I include it here.

I know Perl handles 128..255 a little differently, but there must be some workaround.

I have tried variations on Encode::decode and utf8::upgrade to no avail. Any suggestions how to convert the 128..255 characters?

use Devel::Peek;
my @strings = (
    'Panic Button',                                # String 1
    'Bot\U00F3n de P\U00E1nico',                    # String 2
    'Bot\U00F3n de P\U00E1nico\U200B',                # String 3
    '\U041a\U043d\U043e\U043f\U043a\U0430' . 
      ' \U043f\U0430\U043d\U0438\U043a\U0438',        # String 4
    );
for my $string (@strings){
    print STDERR qq(\n$string\n);
    $string =~ s~  \\U ( [0-9a-fA-F]{4,4} ) 
            ~ 
                chr(hex "0x$1");
            ~gex;
    print "$string\n";
    Dump $string;
}
[download]

In reply to Convert hex to UTF-8 by frevo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.