in reply to Re: Proper Unicode handling in Perl
in thread Is there some universal Unicode+UTF8 switch?
This thread is getting long, and I have a couple screenfuls of comments, output, and source, which I'll put between readmore tags to save scrollfingers...
With respect, I don't think the solution you give is gonna do the trick. Let me add that I'm no windows guru but find myself unable to replicate your result on the platform OP has. The reason that I looked at this thread is that I'm breaking in my new windows laptop with strawberry perl, and I wanted to see if I could do the basic things that OP seeks. In my experience, if it's working up Russian, there is a marsh of mojibake before results obtain.
Obviously, You need to save your source code UTF-8 encoded.Is this a thing? To my understanding, it is the opinion of the software which opens the file as to what its encoding is. On the properties for the script I post here is no such option. This is output from a version of the script that shows the data in different formats. I'm gonna try pre tags here:
C:\Users\tblaz\Documents\evelyn>perl 2.cyr.pl ------- { name => "\x{411}\x{418}\x{411}\x{41B}\x{418}\x{41E}\x{422}\x{415}\x{41A}\x{410}\x{420}\x{42C}", recentactions => 38, userid => 1686692, }, { name => "\x{411}\x{430}\x{431}\x{43A}\x{438}\x{43D}\x{44A} \x{41C}\x{438}\x{445}\x{430}\x{438}\x{43B}\x{44A}", recentactions => 144, userid => 2208294, }, { name => "\x{411}\x{430}\x{434}\x{43C}\x{430} \x{425}\x{430}\x{440}\x{43B}\x{443}\x{435}\x{432}\x{430}", recentactions => 4, userid => 2587115, }, ------- $VAR1 = { 'recentactions' => 38, 'userid' => 1686692, 'name' => "\x{411}\x{418}\x{411}\x{41b}\x{418}\x{41e}\x{422}\x{415}\x{41a}\x{410}\x{420}\x{42c}" }, { 'name' => "\x{411}\x{430}\x{431}\x{43a}\x{438}\x{43d}\x{44a} \x{41c}\x{438}\x{445}\x{430}\x{438}\x{43b}\x{44a}", 'recentactions' => 144, 'userid' => 2208294 }, { 'name' => "\x{411}\x{430}\x{434}\x{43c}\x{430} \x{425}\x{430}\x{440}\x{43b}\x{443}\x{435}\x{432}\x{430}", 'recentactions' => 4, 'userid' => 2587115 } ; ------- Content-Type: text/html; charset=utf-8 <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>╨╨╛╨╣ ╤╨╡╤╤</title> </head> <body> ╨╨░╨▒╨║╨╕╨╜╤ ╨╨╕╤╨░╨╕╨╗╤ </body> </html>
Source that produced this:
You must check whether the JSON data might, in some circumstances, contain characters which have a special meaning in HTML, in particular < and &.#!/usr/bin/perl -w use 5.011; # use utf8; commenting out ## first time for this use utf8::all; use Encode; use LWP::UserAgent; use HTTP::Request::Common; use HTTP::Cookies; use JSON; use Data::Dump; use Data::Dumper; binmode STDOUT, ":utf8"; my $browser = LWP::UserAgent->new; # they ask to use descriptive user-agent - not LWP defaults # w:ru:User:Bot_of_the_Seven = https://ru.wikipedia.org/wiki/У&# +1095;астник:Bot_of_the_Seven $browser->agent('w:ru:User:Bot_of_the_Seven (LWP like Gecko) We come i +n peace'); # I need cookies exchange enabled for auth # here is doesn't matter but to give full LWP picture: $browser->cookie_jar( {} ); # a very few queries can be done by GET - most of MediaWiki require PO +ST # so I do POST all around rather then remember where GET is allowed or + not: my $response = $browser->request( POST 'https://ru.wikipedia.org/w/api.php', { 'format' => 'json', 'formatversion' => 2, 'errorformat' => 'bc', 'action' => 'query', 'list' => 'allusers', 'auactiveusers' => 1, 'aulimit' => 10, 'aufrom' => 'Б' } ); my $data = decode_json( $response->content ); my $test_scalar = $data->{query}->{allusers}[0]->{name}; my @test_array = @{ $data->{query}->{allusers} }[ 0 .. 2 ]; say "test array is @test_array"; say "-------"; dd \@test_array; say "-------"; print Dumper \@test_array; say "-------"; display_html( $test_array[1]->{name} ); sub display_html { use HTML::Entities; my $html_encoded = encode_entities(shift, '<>&"'); my @html = ( '<!DOCTYPE html>', '<html>', '<head>', '<meta charset="UTF-8">', '<title>Мой тест</ti +tle>', '</head>', '<body>', $html_encoded // 'Статус + ОК', # soft OR: 0 and empty string accepted '</body>', '</html>' ); # to avoid "wide character" warnings: binmode STDOUT, ':utf8'; print "Content-Type: text/html; charset=utf-8\n\n"; print join("\n", @html); } __END__
I have tried this script both with and without your changes to the html display, yet his test does not render. Meanwhile, I can read it fine in Notepad and Notepad++. Telling for me is when I asked for a listing on STDOUT. I'll try this abbreviated and with code tags:
EditShowing source listing from haj's subroutine
/EditC:\Users\tblaz\Documents\evelyn>type 2.cyr.pl #!/usr/bin/perl -w use 5.011; ... sub display_html { use HTML::Entities; my $html_encoded = encode_entities(shift, '<>&"'); my @html = ( '<!DOCTYPE html>', '<html>', '<head>', '<meta charset="UTF-8">', '<title>╨╨╛╨╣ ╤╨&# +9569;╤╤</title>', '</head>', '<body>', $html_encoded // '╨╤╨░╤╤ +;╤ Γ ╨₧╨', # soft OR: 0 and empty s +tring accepted '</body>', '</html>' ); # to avoid "wide character" warnings: binmode STDOUT, ':utf8'; print "Content-Type: text/html; charset=utf-8\n\n"; print join("\n", @html); }
What I see is that "My test" does not even render here. To my eye, he has all of the russian on the hook with his data queries; it's just not getting represented correctly on the terminal that Strawberry Perl gives you. His install might be as fresh out of the box as mine.
To illustrate what I think is going on, I created a smaller script:
C:\Users\tblaz\Documents\evelyn>perl 1.hello.cyr.pl ╨╤╨╕╨▓╨╡╤
Source listing:
#!/usr/bin/perl -w use 5.016; use utf8::all; #binmode STDOUT, ":utf8"; say "Привет"; __END__
This one line might best be represented with a p tag:
say "Привет";
Anyways, it seems like there's some wonky IO layer going on here...
Thanks all for interesting comments,
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Proper Unicode handling in Perl
by haj (Vicar) on Sep 05, 2019 at 02:23 UTC | |
by pryrt (Abbot) on Sep 06, 2019 at 00:37 UTC | |
by VK (Novice) on Sep 06, 2019 at 16:50 UTC | |
by haj (Vicar) on Sep 06, 2019 at 18:17 UTC | |
by Aldebaran (Curate) on Sep 10, 2019 at 06:30 UTC | |
Re^3: Proper Unicode handling in Perl
by VK (Novice) on Sep 06, 2019 at 15:01 UTC |