UTF-8 and Unicode the hard way

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Perl Monks! Long-time listener, first-time caller. Despite nearly thirty years of Perl work, there are a few things I hate so much I actively avoid understanding them, and because I tend to always need fast solutions yesterday, there have been a number of times where I've gone for a brute-force approach that isn't very maintainable in the long term.

I get lumps of long text data that have what I, an Old, think of as "extended ASCII characters" or "special characters." I apparently get them in UTF-8. For data output for web page purposes, I must find any extended characters in the data and attempt to translate them to Unicode. My preference would be to translate them to HTML escaped entities, but the client wants Unicode.

My brute-force approach is, for the various data fields which require translation, I call a function of my own:
$answer = &blunt_decode($answer);

The subroutine blunt_decode() consists literally of lines and lines of regex substitutions. Not more than two Unicode codepages' worth, I don't think, covering ninety percent of the extended characters that show up in names (it's always proper names that are the problem), but it's still a lot of lines. e.g.:

    $s =~ s/\xc4\x80/\x{100}/g;
    $s =~ s/\xc4\x81/\x{101}/g;
    $s =~ s/\xc4\x82/\x{102}/g;
    $s =~ s/\xc4\x83/\x{103}/g;
    $s =~ s/\xc4\x84/\x{104}/g;
    $s =~ s/\xc4\x85/\x{105}/g;
    $s =~ s/\xc4\x86/\x{106}/g;
    $s =~ s/\xc4\x87/\x{107}/g;
    $s =~ s/\xc4\x88/\x{108}/g;
    $s =~ s/\xc4\x89/\x{109}/g;
    $s =~ s/\xc4\x8a/\x{10A}/g;
    $s =~ s/\xc4\x8b/\x{10B}/g;
    $s =~ s/\xc4\x8c/\x{10C}/g;
    $s =~ s/\xc4\x8d/\x{10D}/g;
[download]

And so on and so on. You get the idea. I know there has got to be a better way that doesn't require me to essentially maintain my own lookup table, but I don't know what it is and when I start reading about Unicode translation my eyes glaze over and my brain gets fuzzy.

I did look at Encode::Unicode and tried using that, but I must have been doing it wrong, because

use Encode qw(encode decode);
$answer = encode("UCS-2BE", $answer);
[download]

mangled the data output horribly.

In short, I am looking for some canned, easier way to accept a scalar, translate any extended-character UTF8 into the equivalent unicode sequence (leaving the normal characters alone), and spit the translated scalar back.

Comment on UTF-8 and Unicode the hard way Select or Download Code

Replies are listed 'Best First'.
Re: UTF-8 and Unicode the hard way by Corion (Patriarch) on May 09, 2022 at 16:29 UTC
You didn't show any relevant input data, but from your description, I have to wonder whether you tried the following and whether it worked: `my $answer = decode('UTF-8', $answer);` [download] You decode external input when reading, and you encode strings before writing them as output.	[reply] [d/l]
Re^2: UTF-8 and Unicode the hard way by Anonymous Monk on May 09, 2022 at 16:49 UTC
Hmm. Well, that doesn't work either, though. Using `$answer = encode("UCS-2BE", $answer);` results in \u0000 in front of EVERY character in output ... But using `$answer = decode('UTF-8', $answer);` produces a "wide character in output" error.	[reply] [d/l] [select]
Re^3: UTF-8 and Unicode the hard way by haj (Vicar) on May 09, 2022 at 20:51 UTC
Corion provided the correct answer, but you failed to verify it. If you print a decoded non-ASCII character, then you get the `wide character` warning. This is exactly what happens when you print the result of your own substitutions: $ perl -E "print qq(\x{100})" Wide character in print at -e line 1. Ā Printed output needs to be encoded into a byte stream which the receiving side is able to understand. In many cases like contemporary Unix terminals, UTF-8 is a good guess - which is the encoding your `$answer` came from.	[reply] [d/l]
Re^4: UTF-8 and Unicode the hard way by Anonymous Monk on May 10, 2022 at 17:26 UTC
Re^5: UTF-8 and Unicode the hard way by hippo (Archbishop) on May 10, 2022 at 17:38 UTC
Some notes below your chosen depth have not been shown here
Re^5: UTF-8 and Unicode the hard way by Anonymous Monk on May 10, 2022 at 17:32 UTC
Some notes below your chosen depth have not been shown here
Re^3: UTF-8 and Unicode the hard way by Corion (Patriarch) on May 09, 2022 at 17:54 UTC
You shouldn't get a `wide character in output` error from your call to `decode(...)`. Can you please show the relevant code and data that produces that output?	[reply] [d/l] [select]
Re^4: UTF-8 and Unicode the hard way by ikegami (Patriarch) on May 10, 2022 at 19:33 UTC
Re^3: UTF-8 and Unicode the hard way by Anonymous Monk on May 10, 2022 at 07:26 UTC
What do you mean by "unicode", then? By itself, Unicode is a correspondence between characters (i.e. "Ы", CYRILLIC CAPITAL LETTER YERU) and integers (in this case, 0x042B or 1067), except unlike ASCII or other single-byte encodings, the integers aren't limited to the range of [0, 255]. Do you need to transform the UTF-8-encoded text into an array of integers? In order to represent Unicode code points as bytes, you need a Unicode encoding; UTF-8, UTF-16, UCS-2BE are all examples of those. Perl has a native representation for Unicode code points, which it calls wide characters: for Perl programs, they are like bytes, but have values above 255 and are interpreted according to Unicode rules. Since files and input/output streams contain bytes, not Unicode wide characters, there needs to be an additional encoding layer between them and Unicode. Which is why it's important to read at least perlunitut before attempting to work with Unicode in Perl.	[reply]
Re: UTF-8 and Unicode the hard way by Anonymous Monk on May 10, 2022 at 07:21 UTC
?node_id=551676#i_o_flow__the_actual_5_minute_tutorial_	[reply]
Re^2: UTF-8 and Unicode the hard way by Anonymous Monk on May 10, 2022 at 17:33 UTC
Thank you! I will read it until I understand it. Shouldn't take more than a hundred times.	[reply]
Re^3: UTF-8 and Unicode the hard way by kcott (Archbishop) on May 11, 2022 at 09:15 UTC
That's now part of the core documentation: "perlunitut - Perl Unicode Tutorial". The specific section linked is: "I/O flow (the actual 5 minute tutorial)". I had a very quick scan through the text. Although it looks to be much the same, there may well have been updates (improvements, corrections, etc.) since "perlunitut: Unicode in Perl" was written (16 years ago). From: "Re^4: UTF-8 and Unicode the hard way": "(OP here -- sorry, I should have obtained a username before starting this)" Yes, please do. It'll allow you to edit your posts and us to differentiate you from other AMs. — Ken	[reply]