Re^2: Proper Unicode handling in Perl

This thread is getting long, and I have a couple screenfuls of comments, output, and source, which I'll put between readmore tags to save scrollfingers...

This is a pretty straightforward way to deal with Unicode and UTF-8.

With respect, I don't think the solution you give is gonna do the trick. Let me add that I'm no windows guru but find myself unable to replicate your result on the platform OP has. The reason that I looked at this thread is that I'm breaking in my new windows laptop with strawberry perl, and I wanted to see if I could do the basic things that OP seeks. In my experience, if it's working up Russian, there is a marsh of mojibake before results obtain.

Obviously, You need to save your source code UTF-8 encoded.

Is this a thing? To my understanding, it is the opinion of the software which opens the file as to what its encoding is. On the properties for the script I post here is no such option. This is output from a version of the script that shows the data in different formats. I'm gonna try pre tags here:


C:\Users\tblaz\Documents\evelyn>perl 2.cyr.pl
-------

  {
    name => "\x{411}\x{418}\x{411}\x{41B}\x{418}\x{41E}\x{422}\x{415}\x{41A}\x{410}\x{420}\x{42C}",
    recentactions => 38,
    userid => 1686692,
  },
  {
    name => "\x{411}\x{430}\x{431}\x{43A}\x{438}\x{43D}\x{44A} \x{41C}\x{438}\x{445}\x{430}\x{438}\x{43B}\x{44A}",
    recentactions => 144,
    userid => 2208294,
  },
  {
    name => "\x{411}\x{430}\x{434}\x{43C}\x{430} \x{425}\x{430}\x{440}\x{43B}\x{443}\x{435}\x{432}\x{430}",
    recentactions => 4,
    userid => 2587115,
  },

-------
$VAR1 = 
          {
            'recentactions' => 38,
            'userid' => 1686692,
            'name' => "\x{411}\x{418}\x{411}\x{41b}\x{418}\x{41e}\x{422}\x{415}\x{41a}\x{410}\x{420}\x{42c}"
          },
          {
            'name' => "\x{411}\x{430}\x{431}\x{43a}\x{438}\x{43d}\x{44a} \x{41c}\x{438}\x{445}\x{430}\x{438}\x{43b}\x{44a}",
            'recentactions' => 144,
            'userid' => 2208294
          },
          {
            'name' => "\x{411}\x{430}\x{434}\x{43c}\x{430} \x{425}\x{430}\x{440}\x{43b}\x{443}\x{435}\x{432}\x{430}",
            'recentactions' => 4,
            'userid' => 2587115
          }
        ;
-------
Content-Type: text/html; charset=utf-8

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>╨£╨╛╨╣ ╤é╨╡╤ü╤é</title>
</head>
<body>
╨æ╨░╨▒╨║╨╕╨╜╤è ╨£╨╕╤à╨░╨╕╨╗╤è
</body>
</html>

Source that produced this:

#!/usr/bin/perl -w
use 5.011;

# use utf8; commenting out
## first time for this
use utf8::all;
use Encode;

use LWP::UserAgent;
use HTTP::Request::Common;
use HTTP::Cookies;

use JSON;
use Data::Dump;
use Data::Dumper;
binmode STDOUT, ":utf8";

my $browser = LWP::UserAgent->new;

# they ask to use descriptive user-agent - not LWP defaults
# w:ru:User:Bot_of_the_Seven = https://ru.wikipedia.org/wiki/&#1059;&#
+1095;&#1072;&#1089;&#1090;&#1085;&#1080;&#1082;:Bot_of_the_Seven
$browser->agent('w:ru:User:Bot_of_the_Seven (LWP like Gecko) We come i
+n peace');

# I need cookies exchange enabled for auth
# here is doesn't matter but to give full LWP picture:
$browser->cookie_jar( {} );

# a very few queries can be done by GET - most of MediaWiki require PO
+ST
# so I do POST all around rather then remember where GET is allowed or
+ not:
my $response = $browser->request(
  POST 'https://ru.wikipedia.org/w/api.php',
  {
    'format'        => 'json',
    'formatversion' => 2,
    'errorformat'   => 'bc',

    'action'        => 'query',
    'list'          => 'allusers',
    'auactiveusers' => 1,
    'aulimit'       => 10,
    'aufrom'        => '&#1041;'
  }
);

my $data = decode_json( $response->content );

my $test_scalar = $data->{query}->{allusers}[0]->{name};

my @test_array = @{ $data->{query}->{allusers} }[ 0 .. 2 ];

say "test array is @test_array";
say "-------";
dd \@test_array;
say "-------";
print Dumper \@test_array;
say "-------";

display_html( $test_array[1]->{name} );

sub display_html {
    use HTML::Entities;
    my $html_encoded = encode_entities(shift, '<>&"');
    my @html = (
        '<!DOCTYPE html>',
        '<html>',
        '<head>',
        '<meta charset="UTF-8">',
        '<title>&#1052;&#1086;&#1081; &#1090;&#1077;&#1089;&#1090;</ti
+tle>',
        '</head>',
        '<body>',
        $html_encoded // '&#1057;&#1090;&#1072;&#1090;&#1091;&#1089; —
+ &#1054;&#1050;', # soft OR: 0 and empty string accepted
        '</body>',
        '</html>'
    );
    
    # to avoid "wide character" warnings:
    binmode STDOUT, ':utf8';
    
    print "Content-Type: text/html; charset=utf-8\n\n";
    
    print join("\n", @html);
    }
__END__
[download]

You must check whether the JSON data might, in some circumstances, contain characters which have a special meaning in HTML, in particular < and &.

I have tried this script both with and without your changes to the html display, yet his test does not render. Meanwhile, I can read it fine in Notepad and Notepad++. Telling for me is when I asked for a listing on STDOUT. I'll try this abbreviated and with code tags:

Edit

Showing source listing from haj's subroutine

/Edit

C:\Users\tblaz\Documents\evelyn>type 2.cyr.pl
#!/usr/bin/perl -w
use 5.011;
...

sub display_html {
    use HTML::Entities;
    my $html_encoded = encode_entities(shift, '<>&"');
    my @html = (
        '<!DOCTYPE html>',
        '<html>',
        '<head>',
        '<meta charset="UTF-8">',
        '<title>&#9576;£&#9576;&#9563;&#9576;&#9571; &#9572;é&#9576;&#
+9569;&#9572;ü&#9572;é</title>',
        '</head>',
        '<body>',
        $html_encoded // '&#9576;í&#9572;é&#9576;&#9617;&#9572;é&#9572
+;â&#9572;ü &#915;Çö &#9576;&#8359;&#9576;Ü', # soft OR: 0 and empty s
+tring accepted
        '</body>',
        '</html>'
    );

    # to avoid "wide character" warnings:
    binmode STDOUT, ':utf8';

    print "Content-Type: text/html; charset=utf-8\n\n";

    print join("\n", @html);
        }
[download]

What I see is that "My test" does not even render here. To my eye, he has all of the russian on the hook with his data queries; it's just not getting represented correctly on the terminal that Strawberry Perl gives you. His install might be as fresh out of the box as mine.

To illustrate what I think is going on, I created a smaller script:

C:\Users\tblaz\Documents\evelyn>perl 1.hello.cyr.pl
&#9576;ƒ&#9572;Ç&#9576;&#9557;&#9576;&#9619;&#9576;&#9569;&#9572;é
[download]

Source listing:

#!/usr/bin/perl -w
use 5.016;
use utf8::all;
#binmode STDOUT, ":utf8";

say "&#1055;&#1088;&#1080;&#1074;&#1077;&#1090;";
__END__
[download]

This one line might best be represented with a p tag:

say "Привет";

Anyways, it seems like there's some wonky IO layer going on here...

Thanks all for interesting comments,

Comment on Re^2: Proper Unicode handling in Perl Select or Download Code

Replies are listed 'Best First'.
Re^3: Proper Unicode handling in Perl by haj (Vicar) on Sep 05, 2019 at 02:23 UTC
Short answer: I can reproduce your strings '╨£╨╛╨╣ ╤é╨╡╤ü╤é' and '╨æ╨░╨▒╨║╨╕╨╜╤è ╨£╨╕╤à╨░╨╕╨╗╤è' from the correct ones 'Мой тест' and 'Бабкинъ Михаилъ'. It looks like you print strings which are intended for a UTF-8-enabled browser to a terminal which doesn't understand UTF-8. I guess you are using a Windows terminal with a CP437-compatible (cyrillic) codepage. Please retry your test after entering the command `chcp 65001`. Long answer follows. I wrote: Obviously, You need to save your source code UTF-8 encoded. You ask: Is this a thing? To my understanding, it is the opinion of the software which opens the file as to what its encoding is. Yes, it is a thing, which I keep explaining to people who seem to be unaware of the many places where encoding and decoding take place behind the scenes. If you see a cyrillic character on your editor screen, then you see a glyph which looks like, say, Б. The Unicode consortium has assigned the codepoint U+0411 and the name `CYRILLIC CAPITAL LETTER BE` to this character. When such a character is written to a file, then the editor doesn't paint the glyph, nor does it write the codepoint number. Instead, it converts it into a sequence of bytes according to some encoding. In UTF-8, a Б is represented by the (hexadecimal) sequence `D091`, in Windows Codepage 1251 it is represented by the sequence `C1`, and in Windows Codepage 866 it is represented by the sequence `81`. About 20 years ago, Roman Czyborra collected these and other encodings of the cyrillic alphabets under The Cyrillic Charset Soup. Your editor has to chose one of the encodings. It does so according to some system or user preferences, but every editor is different, and some might not even provide decent information about their choice. If you use, for example, Emacs (available on Windows, too), then the buffer's default encoding is displayed on-screen, but you can override it when saving the file. Editors which claim to support Unicode ought to be able to save files under at least one of the UTF encodings. Maybe other monks have current data, but I recall times where Windows editors like Notepad and Notepad++ saved "Unicode" files under UTF-16-BE, which is not UTF-8 and represents a Б by `0411`. This looks like the codepoint number. This is no coincidence, but led some software engineers to the wrong conclusion that this is "the Unicode encoding". Now what happens if an editor opens an existing file? Where does it derive its opinion from? Well, in general, it can't. The byte `C1` could either mean a Б, or an Á, if the file was meant to be read as Windows Codepage 1252. A sequence `D091` renders as Ð‘ under Windows Codepage 1252, as Р‘ under Windows Codepage 1251, and as Б under UTF-8. But again, there is a special case: In UTF-16 encodings, there are two possible ways to write `0411` to disk, depending on whether your hardware architecture is "little endian" or "big endian". To distinguish between these two, the standards use the special character Unicode Character 'ZERO WIDTH NO-BREAK SPACE'. A space which doesn't break words and has no width is pretty invisible, so it doesn't do any harm. Little endian systems write this as `FEFF`, while big endian system swap the bytes and write `FFFE`. So, whenever a file starts with either `FEFF` or `FFFE`, the editor can with some confidence assume that the encoding is UTF-16-LE or UTF-16-BE, respectively. If that invisible space is the first character of a file, it is called a Byte Order Mark, BOM. For UTF-8, the BOM is optional and rarely used, some programs don't like it if it is there, and it has the byte sequence `EFBBBF`. There is no such thing as a BOM for any of the one-byte encodings. If a file does not start with a BOM, you have nothing. Similar things happen when the Perl interpreter reads a file. Per default, Perl 5 expects ISO-8859-1 encoding for its source code, which has no BOM. So, if your source code contains the Byte `C1` in a literal, then Perl interprets it as the letter Á, and if it contains the bytes `D091` in a literal, then Perl interprets it as a Ð followed by a non-printable character, because `91` maps to a control character in ISO-8859-1. To allow human-readable Unicode characters in Perl 5 sources so that you can write Б instead of `"\x{411}"`, the pragma `use utf8;` was introduced. This, however, requires that the `"\x{411}"` has been written to disk as the sequence `D091` by your editor. I've already said this: There is no pragma for UTF-16 or any other Unicode (or Cyrillic) encoding. You wrote: To my eye, he has all of the russian on the hook with his data queries; it's just not getting represented correctly on the terminal that Strawberry Perl gives you. This is another wrong assumptions. It isn't Strawberry Perl which gives you the terminal, it is the Windows operating system. And - truth hurts - if you spit out UTF-8 encoded strings to a Windows terminal, then it might or might not create the correct glyphs. The terminal is, like your editor, a piece of software which receives a bunch of bytes and tries to create the correct character glyphs for you, according to some encoding. The default encoding of the Windows terminal is not UTF-8, but instead some codepage defined in the regional settings of the operating system (I'm currently on a Linux box, so doing that recherche is up to you). You can learn what codepage is active with the `chcp` command, and you can also switch your Windows terminal to UTF-8 with the command `chcp 65001`.	[reply]
Re^4: Proper Unicode handling in Perl (aside on Notepad++) by pryrt (Abbot) on Sep 06, 2019 at 00:37 UTC
Maybe other monks have current data, but I recall times where Windows editors like Notepad and Notepad++ saved "Unicode" files under UTF-16-BE While I'm pretty sure that's accurate for older versions of notepad.exe (it matches my memory; though as of current Win10 1903, even that shows choices for the encoding, including UTF-8 and UTF-16 BE/LE), that's not correct for Notepad++. In modern Notepad++ (v7.7.1), and as far back as I could try* (v4.0 from Jan 2007), Notepad++ has listed separate encodings for UTF-8, UCS-2 LE, and UCS-2 BE, not calling any of them the generic "Unicode" name. (*: I tried most major versions backwards in time, and all agreed in the results. When I tried v3.0 from 2005, it wouldn't even run on my machine, so I didn't go any farther back than that.)	[reply]
Re^5: Proper Unicode handling in Perl (aside on Notepad++) by VK (Novice) on Sep 06, 2019 at 16:50 UTC
At the initial standardization period there were a number of Unicode transport protocols, some of them really weird (7-bit UTF for one). As of now Notepad++ is pretty straightforward with what needed. It has in its settings: ANSI UTF-8 (default, plus switch "Apply to open ANSI files") UTF-8 with BOM UCS-2 Big Endian with BOM UCS-2 Little Endian with BOM other (long-long list) The rule of thumb is that default UTF-8 + "Apply to open ANSI files" is the only thing one ought to use. Anything else is only for two distinct situations: 1) one got some unreadable chunk of chars from a 3rd party file and needs to make it readable Unicode 2) one is seeking for new oops-type adventures for (her\|him)self and for end users.	[reply]
Re^6: Proper Unicode handling in Perl (aside on Notepad++) by haj (Vicar) on Sep 06, 2019 at 18:17 UTC
Re^4: Proper Unicode handling in Perl by Aldebaran (Curate) on Sep 10, 2019 at 06:30 UTC
Codepages seem to be the sticky wicket: C:\Users\tblaz\Documents\evelyn>perl 1.hello.cyr.pl default encoding: Active code page: 437 ╨ƒ╤Ç╨╕╨▓╨╡╤é ------------ Active code page: 65001 Привет `#!/usr/bin/perl -w use 5.016; use utf8::all; #binmode STDOUT, ":utf8"; say "default encoding:"; system "chcp"; say "Привет"; say "------------"; system "chcp 65001"; say "Привет"; __END__` [download] Once the right codepage is present, the test succeeeds: C:\Users\tblaz\Documents\evelyn>perl 1.cyr.pl ... ------- Content-Type: text/html; charset=utf-8 <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Мой тест</title> </head> <body> Бабкинъ Михаилъ </body> </html> C:\Users\tblaz\Documents\evelyn> Now I wonder how to make that the default setting....	[reply] [d/l]
Re^3: Proper Unicode handling in Perl by VK (Novice) on Sep 06, 2019 at 15:01 UTC
As other poster hinted - to get right "8-bits Unicode Transformation Format" (commonly known as "UTF-8") output results it is not sufficient to output it right - it is also needed to output to an environment that is able to interpret UTF-8 properly. In 2019 it takes considerable efforts to find an environment that is not able to do it. Yet as usual Matthew rulez - "and whoever seeks, finds" (Mat. 7:8) And yes, it is important do not screw up things by doing wrong file savings. Me personally, I perl with Notepad++ and with its option on "Encoding -> UTF-8" plus flag on "apply to open ANSI files". That keeps me safe for my purposes. Because as of the year 2019 Unicode/UTF-8 is not yet another encoding to deal with. It is the encoding, the only one to deal with. Anything else - only for very special occasions when the soul is really asking for adventures. :-) (update) With that coming back to my original question. Just try to replace rather convoluted POST calls with LWP native ->get() and ->post() methods - and we are back to the circle one with all response like Ð‘Ð˜Ð‘Ð›Ð˜ÐžÐ¢Ð•ÐšÐÐ Ð and crap. So LWP gets the right response headers (that it is UTF-8). Program says use utf8. Print says output utf8. Yet everything seems to be staying on the logic "if there is a slightest possibility it is not that damn utf-8, then don't use it; if there is not such possibility, still don't use it". This is what I said in my OP: "in order to achieve it I had to make my script like a drunk buddy I have to get from the bar back home :-) - once I lower attention, he tries to fell on the ground and sleep". It may be funny first but it gets highly irritating soon.	[reply]