Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

Perl seems unable to properly handle the slash-x style octets that represent characters in the upper-ascii range or beyond--or perhaps, as is likely, I just don't know what I'm doing! With that in mind, someone can probably spot my error. Here's the issue.

I need to convert files which have stored Latin1-style characters in a (to me) non-standard UTF-8 format to "standard" UTF-8 characters (the actual characters instead of representations of them).

THIS:

\xc3\xa4 <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x8 +4</span><span class="posg pos">Substantiv, Neutrum, das</span><span c +lass="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: des \x +c3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s) </sp +an>

Should be converted to THIS:

ä <span class="sy">ä</span>, <span class="sy">Ä</span><span class=" +posg pos">Substantiv, Neutrum, das</span><span class="vg v"> Ä &#603 +;&#720; das ä; Genitiv: des ä (umgangssprachlich: -s), ä (umgangsspra +chlich: -s) </span>

I've tried everything I can think of. I am able to correctly convert it in a browser by using the following sequence:

use Encode; use utf8; binmode STDOUT, ':utf8'; print "Content-Type:text/html; charset=utf-8\n"; print "Content-Language: utf8;\n\n"; my $ucode = qq|\xc3\xa4 <span class="sy">\xc3\xa4</span>, <span cla +ss="sy">\xc3\x84</span><span class="posg pos">Substantiv, Neutrum, da +s</span><span class="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; +Genitiv: des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprac +hlich: -s) </span>|; my $newcode = decode('utf8', $ucode); print "<p>$newcode</p>\n";

But if I read the line from a file, perform this conversion, and then push that line into an array and print it to a file, I still end up with the slash-x characters that I wanted to abolish! This causes me to think that Perl had not worked the magic at all--it all came from my browser. I hate to think that FireFox handles text encodings more powerfully than Perl is capable of!

What must I do?

Blessings,

~Polyglot~

Replies are listed 'Best First'.
Re: Perl's encoding versus UTF8 octets
by GrandFather (Saint) on Jan 13, 2021 at 05:10 UTC

    When I run your code from Komodo IDE which understands UTF8 it prints:

    Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span><span clas +s="posg pos">Substantiv, Neutrum, das</span><span class="vg v"> Ä &# +603;&#720; das ä; Genitiv: des ä (umgangssprachlich: -s), ä (umgangss +prachlich: -s) </span></p>

    Is that not what you expected to see? Maybe the terminal you are using doesn't understand UTF8?

    Update: note that you don't need use utf8;. That is only required if you want to use UTF8 in your source code. You don't do that, you create a string containing UTF8 characters, but the source code is pure 7 bit ASCII.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      Yes, my code does send that to the browser, as I noted above. I just haven't found a way to send the converted text to a file. Unfortunately, the file sizes I'm working with are well above 100 MB, and it's simply not practical to try to run it all through the browser, click on "view source," copy it out, and paste it into a text file. In fact, even TextWrangler chokes on these file sizes already. I'm trying to convert them before pumping them into a database.

      Blessings,

      ~Polyglot~

        What is actually stored in your files? The literal text you provided in the string, or something else? If it is the text in the string then you can:

        use strict; use warnings; use Encode; binmode STDOUT, ':utf8'; print "Content-Type:text/html; charset=utf-8\n"; print "Content-Language: utf8;\n\n"; my $asText = do {local $/; <DATA>}; $asText =~ s!\\x(..)!chr(hex($1))!ge; my $uCode; my $newcode = decode('utf8', $asText); print "<p>$newcode</p>\n"; __DATA__ \xc3\xa4 <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x84</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: +des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s +) </span>

        Prints:

        Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> Ä &#603;&#720; das ä; Genitiv: des ä (umgangsspra +chlich: -s), ä (umgangssprachlich: -s) </span> </p>
        Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Perl's encoding versus UTF8 octets
by haj (Vicar) on Jan 13, 2021 at 06:53 UTC

    The question is how \xc3\xa4 is actually represented in your file.

    If you write qq|\xc3\xa4|, then Perl interprets the eight characters as two bytes with the hexadecimal values of c3 and a4, respectively. But if you read \xc3\xa4 from a file, this interpretation doesn't take place: These are eight individual ASCII characters. What you can do, of course, is do the interpretation yourself:

    use Encode; my $ucode = q/\xc3\xa4/; # note the use of 'q', not 'qq' my $newcode = decode('utf8',$ucode =~ s/\\x([a-fA-F0-9]{2})/chr hex($1 +)/egr);

      Yes, it appears you have the same solution that GrandFather posted, but have managed to crystallize it into a one-liner. Thank you. I'm sure future readers will appreciate this additional clarity.

      NOTE: Yes, the file does have those literal characters, and they were being processed as eight ascii characters by Perl.

      Blessings,

      ~Polyglot~