in reply to Re^2: Perl's encoding versus UTF8 octets
in thread Perl's encoding versus UTF8 octets

What is actually stored in your files? The literal text you provided in the string, or something else? If it is the text in the string then you can:

use strict; use warnings; use Encode; binmode STDOUT, ':utf8'; print "Content-Type:text/html; charset=utf-8\n"; print "Content-Language: utf8;\n\n"; my $asText = do {local $/; <DATA>}; $asText =~ s!\\x(..)!chr(hex($1))!ge; my $uCode; my $newcode = decode('utf8', $asText); print "<p>$newcode</p>\n"; __DATA__ \xc3\xa4 <span class="sy">\xc3\xa4</span>, <span class="sy">\xc3\x84</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> \xc3\x84 \xc9\x9b\xcb\x90 das \xc3\xa4; Genitiv: +des \xc3\xa4 (umgangssprachlich: -s), \xc3\xa4 (umgangssprachlich: -s +) </span>

Prints:

Content-Type:text/html; charset=utf-8 Content-Language: utf8; <p>ä <span class="sy">ä</span>, <span class="sy">Ä</span> <span class="posg pos">Substantiv, Neutrum, das</span> <span class="vg v"> Ä &#603;&#720; das ä; Genitiv: des ä (umgangsspra +chlich: -s), ä (umgangssprachlich: -s) </span> </p>
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

Replies are listed 'Best First'.
Re^4: Perl's encoding versus UTF8 octets
by Polyglot (Chaplain) on Jan 13, 2021 at 07:00 UTC

    My file format isn't exactly like what I gave earlier, but I don't think the difference is significant in this case. I'm doing some tag reductions and reformatting to prepare it for DB insertion, and there was no sense in posting all of the bloat here.

    I'd tried something before that had given me the results to be obtained by your line:

    $asText =~ s!\\x(..)!chr(hex($1))!ge;

    However, using that in conjunction with the subsequent "decode" process did the trick! I guess it required that specific TWO-STEP conversion process, and all of my attempts had stopped at one--at least within my code's conversion, not counting setting the file encodings on reading and writing. I'm no stranger to encoding issues, but hadn't worked with these slash-x octets before (I don't even know what they're supposed to be called), and these really threw me for a loop.

    So, THANK YOU so much!

    Blessings,

    ~Polyglot~