Perl's encoding versus UTF8 octets

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

Perl seems unable to properly handle the slash-x style octets that represent characters in the upper-ascii range or beyond--or perhaps, as is likely, I just don't know what I'm doing! With that in mind, someone can probably spot my error. Here's the issue.

I need to convert files which have stored Latin1-style characters in a (to me) non-standard UTF-8 format to "standard" UTF-8 characters (the actual characters instead of representations of them).

THIS:

\xc3\xa4    <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x8
+4</span><span class="posg pos">Substantiv, Neutrum, das</span><span c
+lass="vg v"> \xc3\x84  \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: des \x
+c3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s) </sp
+an>
[download]

Should be converted to THIS:

ä    <span class="sy">ä</span>, <span class="sy">Ä</span><span class="
+posg pos">Substantiv, Neutrum, das</span><span class="vg v"> Ä  &#603
+;&#720; das ä; Genitiv: des ä (umgangssprachlich: -s), ä (umgangsspra
+chlich: -s) </span>
[download]

I've tried everything I can think of. I am able to correctly convert it in a browser by using the following sequence:

use Encode;
use utf8;

binmode STDOUT, ':utf8';

print "Content-Type:text/html; charset=utf-8\n";
print "Content-Language: utf8;\n\n";

my $ucode = qq|\xc3\xa4    <span class="sy">\xc3\xa4</span>, <span cla
+ss="sy">\xc3\x84</span><span class="posg pos">Substantiv, Neutrum, da
+s</span><span class="vg v"> \xc3\x84  \xc9\x9b\xcb\x90 das \xc3\xa4; 
+Genitiv: des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprac
+hlich: -s) </span>|;

my $newcode = decode('utf8', $ucode);

print "<p>$newcode</p>\n";
[download]

But if I read the line from a file, perform this conversion, and then push that line into an array and print it to a file, I still end up with the slash-x characters that I wanted to abolish! This causes me to think that Perl had not worked the magic at all--it all came from my browser. I hate to think that FireFox handles text encodings more powerfully than Perl is capable of!

What must I do?

Blessings,

~Polyglot~

Comment on Perl's encoding versus UTF8 octets Select or Download Code

Replies are listed 'Best First'.
Re: Perl's encoding versus UTF8 octets by GrandFather (Saint) on Jan 13, 2021 at 05:10 UTC
When I run your code from Komodo IDE which understands UTF8 it prints: `Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span><span clas +s="posg pos">Substantiv, Neutrum, das</span><span class="vg v"> Ä &# +603;ː das ä; Genitiv: des ä (umgangssprachlich: -s), ä (umgangss +prachlich: -s) </span></p>` [download] Is that not what you expected to see? Maybe the terminal you are using doesn't understand UTF8? Update: note that you don't need `use utf8;`. That is only required if you want to use UTF8 in your source code. You don't do that, you create a string containing UTF8 characters, but the source code is pure 7 bit ASCII. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Perl's encoding versus UTF8 octets by Polyglot (Chaplain) on Jan 13, 2021 at 05:27 UTC
Yes, my code does send that to the browser, as I noted above. I just haven't found a way to send the converted text to a file. Unfortunately, the file sizes I'm working with are well above 100 MB, and it's simply not practical to try to run it all through the browser, click on "view source," copy it out, and paste it into a text file. In fact, even TextWrangler chokes on these file sizes already. I'm trying to convert them before pumping them into a database. Blessings, ~Polyglot~	[reply]
Re^3: Perl's encoding versus UTF8 octets by GrandFather (Saint) on Jan 13, 2021 at 05:57 UTC
What is actually stored in your files? The literal text you provided in the string, or something else? If it is the text in the string then you can: use strict; use warnings; use Encode; binmode STDOUT, ':utf8'; print "Content-Type:text/html; charset=utf-8\n"; print "Content-Language: utf8;\n\n"; my $asText = do {local $/; <DATA>}; $asText =~ s!\\x(..)!chr(hex($1))!ge; my $uCode; my $newcode = decode('utf8', $asText); print "<p>$newcode</p>\n"; __DATA__ \xc3\xa4 <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x84</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: +des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s +) </span> [download] Prints: `Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> Ä ɛː das ä; Genitiv: des ä (umgangsspra +chlich: -s), ä (umgangssprachlich: -s) </span> </p>` [download] Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^4: Perl's encoding versus UTF8 octets by Polyglot (Chaplain) on Jan 13, 2021 at 07:00 UTC
Re: Perl's encoding versus UTF8 octets by haj (Vicar) on Jan 13, 2021 at 06:53 UTC
The question is how `\xc3\xa4` is actually represented in your file. If you write `qq\|\xc3\xa4\|`, then Perl interprets the eight characters as two bytes with the hexadecimal values of `c3` and `a4`, respectively. But if you read `\xc3\xa4` from a file, this interpretation doesn't take place: These are eight individual ASCII characters. What you can do, of course, is do the interpretation yourself: `use Encode; my $ucode = q/\xc3\xa4/; # note the use of 'q', not 'qq' my $newcode = decode('utf8',$ucode =~ s/\\x([a-fA-F0-9]{2})/chr hex($1 +)/egr);` [download]	[reply] [d/l]
Re^2: Perl's encoding versus UTF8 octets by Polyglot (Chaplain) on Jan 13, 2021 at 07:06 UTC
Yes, it appears you have the same solution that GrandFather posted, but have managed to crystallize it into a one-liner. Thank you. I'm sure future readers will appreciate this additional clarity. NOTE: Yes, the file does have those literal characters, and they were being processed as eight ascii characters by Perl. Blessings, ~Polyglot~	[reply]