Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

UTF8 issue when getting website via LWP::UserAgent in Perl

by ultranerds (Hermit)
on May 12, 2016 at 13:02 UTC ( [id://1162858]=perlquestion: print w/replies, xml ) Need Help??

ultranerds has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm trying to make a basic page, that will grab a site (with non-latin characters on - such as the below Russian page) ... and then extract and print out the title. Here is my code:
#!/usr/bin/perl use CGI::Carp qw(fatalsToBrowser); use strict; use lib './'; use warnings; use LWP::UserAgent; use HTTP::Request::Common qw(GET); my $ua = LWP::UserAgent->new; # Define user agent type $ua->agent('Mozilla/8.0'); # Request object my $req = GET 'http://www.gazeta.ru/culture/2016/04/22/a_8191769.s +html'; # Make the request my $res = $ua->request($req); binmode STDOUT, ":utf8"; print "Content-Type: text/html; charset=utf-8 \n\n"; if ($res->is_success) { my $title; $res->content =~ /<title>(.+?)<\/title>/ and $title = $1; print "GOT TITLE: $title \n"; } else { print $res->status_line . "\n"; }
For some reason, it just doesn't seem to want to play ball - I end up with:

-2016 - - .Ru

Instead of:

Румыния не будет участвовать в Евровидении-2016 из-за денег - Газета.Ru

Does anyone have any suggestions?

Thanks!

Andy

Replies are listed 'Best First'.
Re: UTF8 issue when getting website via LWP::UserAgent in Perl
by choroba (Cardinal) on May 12, 2016 at 13:14 UTC
    Interesting! They banned a singer from a TV contest because his country's TV company is heavily indebted?

    OK, besides trying to read the Cyrilic letters, I also examined the HTML tags. It seems the following part is important:

    <meta http-equiv="Content-Type" content="text/html; charset=windows-12 +51" />

    And yes, when decoding from the given encoding, it works:

    use Encode; # ... $title = decode('cp1251', $title);

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      If the site sends the proper HTTP headers, ->decoded_content should do that for you already. But often, the headers and the <meta> tag don't correspond to the actual content encoding...

      See HTTP::Message for discussion of ->decoded_content.

        Hi,

        Ah, you star... ->decoded_content worked just fine.

        I agree that they don't always match up - but I'm not really sure how you can get around that, is they have buggered up the correct response headers / meta tags for it ?

        Cheers

        Andy
        Oh actually... maybe it didn't :/ It works the first time you print it out (directly) ... but then when saved to the DB, it gets corrupt (even though the DB is in utf_bin)

        Any other ideas? Who would have thought this would be such a royal PITA! I wish we just all moved over to one charset ;) Here is what a dumper of the values looks like:
        $VAR1 = { 'title_flag' => 0, 'og_desc' => "\x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438}\x{4 +4f} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{436}\x{435}\x{442} \x{43f} +\x{43e}\x{43a}\x{430}\x{437}\x{44b}\x{432}\x{430}\x{442}\x{44c} \x{44 +2}\x{435}\x{43b}\x{435}\x{432}\x{438}\x{437}\x{438}\x{43e}\x{43d}\x{4 +3d}\x{44b}\x{439} \x{43a}\x{43e}\x{43d}\x{43a}\x{443}\x{440}\x{441} \ +x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{43d}\ +x{438}\x{435}-2016\x{bb}, \x{43f}\x{435}\x{432}\x{435}\x{446} \x{41e} +\x{432}\x{438}\x{434}\x{438}\x{443} \x{410}\x{43d}\x{442}\x{43e}\x{43 +d} \x{43d}\x{435} \x{432}\x{44b}\x{441}\x{442}\x{443}\x{43f}\x{438}\x +{442} \x{432} \x{421}\x{442}\x{43e}\x{43a}\x{433}\x{43e}\x{43b}\x{44c +}\x{43c}\x{435}, \x{430} \x{440}\x{443}\x{43c}\x{44b}\x{43d}\x{441}\x +{43a}\x{438}\x{435} \x{442}\x{435}\x{43b}\x{435}\x{437}\x{440}\x{438} +\x{442}\x{435}\x{43b}\x{438} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{4 +33}\x{443}\x{442} \x{43f}\x{440}\x{43e}\x{433}\x{43e}\x{43b}\x{43e}\x +{441}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{437}\x{430} \x{43f}\x{43e +}\x{43d}\x{440}\x{430}\x{432}\x{438}\x{432}\x{448}\x{438}\x{445}\x{44 +1}\x{44f} \x{43c}\x{443}\x{437}\x{44b}\x{43a}\x{430}\x{43d}\x{442}\x{ +43e}\x{432} \x{2014} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{43e}\x{4 +3b}\x{433}\x{430} \x{432} 16 \x{43c}\x{43b}\x{43d} \x{448}\x{432}\x{4 +35}\x{439}\x{446}\x{430}\x{440}\x{441}\x{43a}\x{438}\x{445} \x{444}\x +{440}\x{430}\x{43d}\x{43a}\x{43e}\x{432}.", 'title' => " \x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438}\x{44 +f} \x{43d}\x{435} \x{431}\x{443}\x{434}\x{435}\x{442} \x{443}\x{447}\ +x{430}\x{441}\x{442}\x{432}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{432 +} \x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{43 +d}\x{438}\x{438}-2016\x{bb} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{4 +35}\x{43d}\x{435}\x{433} - \x{413}\x{430}\x{437}\x{435}\x{442}\x{430} +.Ru ", 'charset' => 'windows-1251', 'og_image' => ' http://img.gazeta.ru/files3/123/8192123/rumi +n-pic905-895x505-99564.jpg', 'description' => "\x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438} +\x{44f} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{436}\x{435}\x{442} \x{ +43f}\x{43e}\x{43a}\x{430}\x{437}\x{44b}\x{432}\x{430}\x{442}\x{44c} \ +x{442}\x{435}\x{43b}\x{435}\x{432}\x{438}\x{437}\x{438}\x{43e}\x{43d} +\x{43d}\x{44b}\x{439} \x{43a}\x{43e}\x{43d}\x{43a}\x{443}\x{440}\x{44 +1} \x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{4 +3d}\x{438}\x{435}-2016\x{bb}, \x{43f}\x{435}\x{432}\x{435}\x{446} \x{ +41e}\x{432}\x{438}\x{434}\x{438}\x{443} \x{410}\x{43d}\x{442}\x{43e}\ +x{43d} \x{43d}\x{435} \x{432}\x{44b}\x{441}\x{442}\x{443}\x{43f}\x{43 +8}\x{442} \x{432} \x{421}\x{442}\x{43e}\x{43a}\x{433}\x{43e}\x{43b}\x +{44c}\x{43c}\x{435}, \x{430} \x{440}\x{443}\x{43c}\x{44b}\x{43d}\x{44 +1}\x{43a}\x{438}\x{435} \x{442}\x{435}\x{43b}\x{435}\x{437}\x{440}\x{ +438}\x{442}\x{435}\x{43b}\x{438} \x{43d}\x{435} \x{441}\x{43c}\x{43e} +\x{433}\x{443}\x{442} \x{43f}\x{440}\x{43e}\x{433}\x{43e}\x{43b}\x{43 +e}\x{441}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{437}\x{430} \x{43f}\x +{43e}\x{43d}\x{440}\x{430}\x{432}\x{438}\x{432}\x{448}\x{438}\x{445}\ +x{441}\x{44f} \x{43c}\x{443}\x{437}\x{44b}\x{43a}\x{430}\x{43d}\x{442 +}\x{43e}\x{432} \x{2014} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{43e} +\x{43b}\x{433}\x{430} \x{432} 16 \x{43c}\x{43b}\x{43d} \x{448}\x{432} +\x{435}\x{439}\x{446}\x{430}\x{440}\x{441}\x{43a}\x{438}\x{445} \x{44 +4}\x{440}\x{430}\x{43d}\x{43a}\x{43e}\x{432}." };
        ...and here is the outputted JSON:

        {"page_title":" Румыния не будет участвовать в Евровидении-2016 из-за денег - Газета.Ru ","description":"Румыния не сможет показывать телевизионный конкурс Евровидение-2016, певец Овидиу Антон не выступит в Стокгольме, а румынские телезрители не смогут проголосовать за понравившихся музыкантов из-за долга в 16 млн швейцарских франков."}

        ..yet here is what comes back out:
        $VAR1 = { 'images' => '', 'all_images' => '{"image_loop":["http://static.gazeta.ru/nm2 +012/i/quotes/finam_head.png","/nm2015//gzt/img/logo.png"," http://img +.gazeta.ru/files3/123/8192123/rumin-pic905-895x505-99564.jpg"," http: +//img.gazeta.ru/files3/885/8195885/igra-pic265-265x150-77294.jpg"," h +ttp://img.gazeta.ru/files3/725/7953725/RIAN_02710972.HR.ru-pic410-410 +x230-99945.jpg"," http://img.gazeta.ru/files3/331/8116331/2016-02-22T +104304Z_1519170817_D1AESOMBZKAD_RTRMADP_3_UKRAINE-TATARS-EUROVISION-p +ic410-410x230-5670.jpg","http://static.smi2.net/srcimg/2780020.png"," +/nm2015/gzt/img/logo_footer.png","http://static.gazeta.ru/nm2012/i/re +uters_a2.png","http://static.gazeta.ru/nm2012/i/prime_a2.png","http:/ +/static.gazeta.ru/nm2012/i/interfax_a2.png","http://static.gazeta.ru/ +nm2012/i/ria_a2.png","http://static.gazeta.ru/nm2012/i/it_a3.png","ht +tp://static.gazeta.ru/nm2012/i/lj_a2.png"," http://img.gazeta.ru/file +s3/123/8192123/rumin-pic905-895x505-99564.jpg"]}', 'url' => 'www.gazeta.ru/culture/2016/04/22/a_8191769.shtml', 'title' => ' &#131;м&#139;ния не б&#131;де&#1 +30; &#131;&#135;ас&#130;вова&#130;&#140; в «&#149;в&# +128;овидении-2016» из-за денег - &#147;азе&#13 +0;а.Ru ', 'description' => ' &#131;м&#139;ния не сможе& +#130; показ&#139;ва&#130;&#140; &#130;елевизионн +&#139;й конк&#131;&#128;с «&#149;в&#128;овидение +-2016», певе&#134; &#158;види&#131; Ан&#130;он не +в&#139;с&#130;&#131;пи&#130; в С&#130;окгол&#140;м +, а &#128;&#131;м&#139;нские &#130;елез&#128;и&#13 +0;ели не смог&#131;&#130; п&#128;оголосова&#13 +0;&#140; за пон&#128;авив&#136;и&#133;ся м&#131;з +&#139;кан&#130;ов &#128;&#148; из-за долга в 16 мл +н &#136;вей&#134;а&#128;ски&#133; &#132;&#128;анко +.', 'grab_id' => '133' }; {"page_title":" &#131;м&#139;ния не б&#131;де&#130; &# +131;&#135;ас&#130;вова&#130;&#140; в «&#149;в&#128;о +идении-2016» из-за денег - &#147;азе&#130;а.Ru + ","description":" &#131;м&#139;ния не сможе&#130; п +каз&#139;ва&#130;&#140; &#130;елевизионн&#139;й + конк&#131;&#128;с «&#149;в&#128;овидение-2016», +певе&#134; &#158;види&#131; Ан&#130;он не в&#139; +с&#130;&#131;пи&#130; в С&#130;окгол&#140;ме, а &# +128;&#131;м&#139;нские &#130;елез&#128;и&#130;ели +не смог&#131;&#130; п&#128;оголосова&#130;&#140; + за пон&#128;авив&#136;и&#133;ся м&#131;з&#139;к +н&#130;ов &#128;&#148; из-за долга в 16 млн &#136 +;вей&#134;а&#128;ски&#133; &#132;&#128;анков.","cach +ed":1}


        Cheers

        Andy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1162858]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2024-03-28 17:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found