Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: UTF8 issue when getting website via LWP::UserAgent in Perl

by Corion (Patriarch)
on May 12, 2016 at 13:21 UTC ( [id://1162860]=note: print w/replies, xml ) Need Help??


in reply to Re: UTF8 issue when getting website via LWP::UserAgent in Perl
in thread UTF8 issue when getting website via LWP::UserAgent in Perl

If the site sends the proper HTTP headers, ->decoded_content should do that for you already. But often, the headers and the <meta> tag don't correspond to the actual content encoding...

See HTTP::Message for discussion of ->decoded_content.

Replies are listed 'Best First'.
Re^3: UTF8 issue when getting website via LWP::UserAgent in Perl
by ultranerds (Hermit) on May 12, 2016 at 14:19 UTC
    Hi,

    Ah, you star... ->decoded_content worked just fine.

    I agree that they don't always match up - but I'm not really sure how you can get around that, is they have buggered up the correct response headers / meta tags for it ?

    Cheers

    Andy
Re^3: UTF8 issue when getting website via LWP::UserAgent in Perl
by ultranerds (Hermit) on May 12, 2016 at 14:40 UTC
    Oh actually... maybe it didn't :/ It works the first time you print it out (directly) ... but then when saved to the DB, it gets corrupt (even though the DB is in utf_bin)

    Any other ideas? Who would have thought this would be such a royal PITA! I wish we just all moved over to one charset ;) Here is what a dumper of the values looks like:
    $VAR1 = { 'title_flag' => 0, 'og_desc' => "\x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438}\x{4 +4f} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{436}\x{435}\x{442} \x{43f} +\x{43e}\x{43a}\x{430}\x{437}\x{44b}\x{432}\x{430}\x{442}\x{44c} \x{44 +2}\x{435}\x{43b}\x{435}\x{432}\x{438}\x{437}\x{438}\x{43e}\x{43d}\x{4 +3d}\x{44b}\x{439} \x{43a}\x{43e}\x{43d}\x{43a}\x{443}\x{440}\x{441} \ +x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{43d}\ +x{438}\x{435}-2016\x{bb}, \x{43f}\x{435}\x{432}\x{435}\x{446} \x{41e} +\x{432}\x{438}\x{434}\x{438}\x{443} \x{410}\x{43d}\x{442}\x{43e}\x{43 +d} \x{43d}\x{435} \x{432}\x{44b}\x{441}\x{442}\x{443}\x{43f}\x{438}\x +{442} \x{432} \x{421}\x{442}\x{43e}\x{43a}\x{433}\x{43e}\x{43b}\x{44c +}\x{43c}\x{435}, \x{430} \x{440}\x{443}\x{43c}\x{44b}\x{43d}\x{441}\x +{43a}\x{438}\x{435} \x{442}\x{435}\x{43b}\x{435}\x{437}\x{440}\x{438} +\x{442}\x{435}\x{43b}\x{438} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{4 +33}\x{443}\x{442} \x{43f}\x{440}\x{43e}\x{433}\x{43e}\x{43b}\x{43e}\x +{441}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{437}\x{430} \x{43f}\x{43e +}\x{43d}\x{440}\x{430}\x{432}\x{438}\x{432}\x{448}\x{438}\x{445}\x{44 +1}\x{44f} \x{43c}\x{443}\x{437}\x{44b}\x{43a}\x{430}\x{43d}\x{442}\x{ +43e}\x{432} \x{2014} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{43e}\x{4 +3b}\x{433}\x{430} \x{432} 16 \x{43c}\x{43b}\x{43d} \x{448}\x{432}\x{4 +35}\x{439}\x{446}\x{430}\x{440}\x{441}\x{43a}\x{438}\x{445} \x{444}\x +{440}\x{430}\x{43d}\x{43a}\x{43e}\x{432}.", 'title' => " \x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438}\x{44 +f} \x{43d}\x{435} \x{431}\x{443}\x{434}\x{435}\x{442} \x{443}\x{447}\ +x{430}\x{441}\x{442}\x{432}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{432 +} \x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{43 +d}\x{438}\x{438}-2016\x{bb} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{4 +35}\x{43d}\x{435}\x{433} - \x{413}\x{430}\x{437}\x{435}\x{442}\x{430} +.Ru ", 'charset' => 'windows-1251', 'og_image' => ' http://img.gazeta.ru/files3/123/8192123/rumi +n-pic905-895x505-99564.jpg', 'description' => "\x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438} +\x{44f} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{436}\x{435}\x{442} \x{ +43f}\x{43e}\x{43a}\x{430}\x{437}\x{44b}\x{432}\x{430}\x{442}\x{44c} \ +x{442}\x{435}\x{43b}\x{435}\x{432}\x{438}\x{437}\x{438}\x{43e}\x{43d} +\x{43d}\x{44b}\x{439} \x{43a}\x{43e}\x{43d}\x{43a}\x{443}\x{440}\x{44 +1} \x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{4 +3d}\x{438}\x{435}-2016\x{bb}, \x{43f}\x{435}\x{432}\x{435}\x{446} \x{ +41e}\x{432}\x{438}\x{434}\x{438}\x{443} \x{410}\x{43d}\x{442}\x{43e}\ +x{43d} \x{43d}\x{435} \x{432}\x{44b}\x{441}\x{442}\x{443}\x{43f}\x{43 +8}\x{442} \x{432} \x{421}\x{442}\x{43e}\x{43a}\x{433}\x{43e}\x{43b}\x +{44c}\x{43c}\x{435}, \x{430} \x{440}\x{443}\x{43c}\x{44b}\x{43d}\x{44 +1}\x{43a}\x{438}\x{435} \x{442}\x{435}\x{43b}\x{435}\x{437}\x{440}\x{ +438}\x{442}\x{435}\x{43b}\x{438} \x{43d}\x{435} \x{441}\x{43c}\x{43e} +\x{433}\x{443}\x{442} \x{43f}\x{440}\x{43e}\x{433}\x{43e}\x{43b}\x{43 +e}\x{441}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{437}\x{430} \x{43f}\x +{43e}\x{43d}\x{440}\x{430}\x{432}\x{438}\x{432}\x{448}\x{438}\x{445}\ +x{441}\x{44f} \x{43c}\x{443}\x{437}\x{44b}\x{43a}\x{430}\x{43d}\x{442 +}\x{43e}\x{432} \x{2014} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{43e} +\x{43b}\x{433}\x{430} \x{432} 16 \x{43c}\x{43b}\x{43d} \x{448}\x{432} +\x{435}\x{439}\x{446}\x{430}\x{440}\x{441}\x{43a}\x{438}\x{445} \x{44 +4}\x{440}\x{430}\x{43d}\x{43a}\x{43e}\x{432}." };
    ...and here is the outputted JSON:

    {"page_title":" Румыния не будет участвовать в «Евровидении-2016» из-за денег - Газета.Ru ","description":"Румыния не сможет показывать телевизионный конкурс «Евровидение-2016», певец Овидиу Антон не выступит в Стокгольме, а румынские телезрители не смогут проголосовать за понравившихся музыкантов — из-за долга в 16 млн швейцарских франков."}

    ..yet here is what comes back out:
    $VAR1 = { 'images' => '', 'all_images' => '{"image_loop":["http://static.gazeta.ru/nm2 +012/i/quotes/finam_head.png","/nm2015//gzt/img/logo.png"," http://img +.gazeta.ru/files3/123/8192123/rumin-pic905-895x505-99564.jpg"," http: +//img.gazeta.ru/files3/885/8195885/igra-pic265-265x150-77294.jpg"," h +ttp://img.gazeta.ru/files3/725/7953725/RIAN_02710972.HR.ru-pic410-410 +x230-99945.jpg"," http://img.gazeta.ru/files3/331/8116331/2016-02-22T +104304Z_1519170817_D1AESOMBZKAD_RTRMADP_3_UKRAINE-TATARS-EUROVISION-p +ic410-410x230-5670.jpg","http://static.smi2.net/srcimg/2780020.png"," +/nm2015/gzt/img/logo_footer.png","http://static.gazeta.ru/nm2012/i/re +uters_a2.png","http://static.gazeta.ru/nm2012/i/prime_a2.png","http:/ +/static.gazeta.ru/nm2012/i/interfax_a2.png","http://static.gazeta.ru/ +nm2012/i/ria_a2.png","http://static.gazeta.ru/nm2012/i/it_a3.png","ht +tp://static.gazeta.ru/nm2012/i/lj_a2.png"," http://img.gazeta.ru/file +s3/123/8192123/rumin-pic905-895x505-99564.jpg"]}', 'url' => 'www.gazeta.ru/culture/2016/04/22/a_8191769.shtml', 'title' => ' Π Ρ&#131;ΠΌΡ&#139;ния Π½Π΅ Π±Ρ&#131;Π΄Π΅Ρ&#1 +30; Ρ&#131;Ρ&#135;асΡ&#130;Π²ΠΎΠ²Π°Ρ&#130;Ρ&#140; Π² Β«Π&#149;Π²Ρ&# +128;ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π&#147;Π°Π·Π΅Ρ&#13 +0;Π°.Ru ', 'description' => 'Π Ρ&#131;ΠΌΡ&#139;ния Π½Π΅ смоТСΡ& +#130; ΠΏΠΎΠΊΠ°Π·Ρ&#139;Π²Π°Ρ&#130;Ρ&#140; Ρ&#130;Π΅Π»Π΅Π²ΠΈΠ·ΠΈΠΎΠ½Π½ +Ρ&#139;ΠΉ ΠΊΠΎΠ½ΠΊΡ&#131;Ρ&#128;с Β«Π&#149;Π²Ρ&#128;ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠ΅ +-2016Β», ΠΏΠ΅Π²Π΅Ρ&#134; Π&#158;Π²ΠΈΠ΄ΠΈΡ&#131; АнΡ&#130;ΠΎΠ½ Π½Π΅ +Π²Ρ&#139;сΡ&#130;Ρ&#131;ΠΏΠΈΡ&#130; Π² Π‘Ρ&#130;ΠΎΠΊΠ³ΠΎΠ»Ρ&#140;ΠΌΠ +΅, Π° Ρ&#128;Ρ&#131;ΠΌΡ&#139;нскиС Ρ&#130;Π΅Π»Π΅Π·Ρ&#128;ΠΈΡ&#13 +0;Π΅Π»ΠΈ Π½Π΅ смогΡ&#131;Ρ&#130; ΠΏΡ&#128;оголосоваΡ&#13 +0;Ρ&#140; Π·Π° ΠΏΠΎΠ½Ρ&#128;Π°Π²ΠΈΠ²Ρ&#136;ΠΈΡ&#133;ся ΠΌΡ&#131;Π·Ρ +&#139;ΠΊΠ°Π½Ρ&#130;ΠΎΠ² β&#128;&#148; ΠΈΠ·-Π·Π° Π΄ΠΎΠ»Π³Π° Π² 16 ΠΌΠ» +Π½ Ρ&#136;Π²Π΅ΠΉΡ&#134;Π°Ρ&#128;скиΡ&#133; Ρ&#132;Ρ&#128;Π°Π½ΠΊΠΎΠ +².', 'grab_id' => '133' }; {"page_title":" Π Ρ&#131;ΠΌΡ&#139;ния Π½Π΅ Π±Ρ&#131;Π΄Π΅Ρ&#130; Ρ&# +131;Ρ&#135;асΡ&#130;Π²ΠΎΠ²Π°Ρ&#130;Ρ&#140; Π² Β«Π&#149;Π²Ρ&#128;ΠΎΠ +²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π&#147;Π°Π·Π΅Ρ&#130;Π°.Ru + ","description":"Π Ρ&#131;ΠΌΡ&#139;ния Π½Π΅ смоТСΡ&#130; ΠΏΠ +ΎΠΊΠ°Π·Ρ&#139;Π²Π°Ρ&#130;Ρ&#140; Ρ&#130;Π΅Π»Π΅Π²ΠΈΠ·ΠΈΠΎΠ½Π½Ρ&#139;ΠΉ + ΠΊΠΎΠ½ΠΊΡ&#131;Ρ&#128;с Β«Π&#149;Π²Ρ&#128;ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠ΅-2016Β», +ΠΏΠ΅Π²Π΅Ρ&#134; Π&#158;Π²ΠΈΠ΄ΠΈΡ&#131; АнΡ&#130;ΠΎΠ½ Π½Π΅ Π²Ρ&#139; +сΡ&#130;Ρ&#131;ΠΏΠΈΡ&#130; Π² Π‘Ρ&#130;ΠΎΠΊΠ³ΠΎΠ»Ρ&#140;ΠΌΠ΅, Π° Ρ&# +128;Ρ&#131;ΠΌΡ&#139;нскиС Ρ&#130;Π΅Π»Π΅Π·Ρ&#128;ΠΈΡ&#130;Π΅Π»ΠΈ +Π½Π΅ смогΡ&#131;Ρ&#130; ΠΏΡ&#128;оголосоваΡ&#130;Ρ&#140; + Π·Π° ΠΏΠΎΠ½Ρ&#128;Π°Π²ΠΈΠ²Ρ&#136;ΠΈΡ&#133;ся ΠΌΡ&#131;Π·Ρ&#139;ΠΊΠ +°Π½Ρ&#130;ΠΎΠ² β&#128;&#148; ΠΈΠ·-Π·Π° Π΄ΠΎΠ»Π³Π° Π² 16 ΠΌΠ»Π½ Ρ&#136 +;Π²Π΅ΠΉΡ&#134;Π°Ρ&#128;скиΡ&#133; Ρ&#132;Ρ&#128;Π°Π½ΠΊΠΎΠ².","cach +ed":1}


    Cheers

    Andy

      Your DB handle must understand what it's getting (or else you'll have to do your own encode/decode in and out of it) and your DB column charset must be set to what you think it is. Search utf-8 in the Pod for whatever DBI DBD driver you are using and make sure the column definition in your table is what you expect/need.

        Hi,

        Mmm.. its almost like its being double encoded... Here is an example:

        print "GOT TITLE BEFORE: $title \n<Br><Br>"; $title = encode("utf-8",$title); print "GOT TITLE: $title \n<Br><Br>"; $DB->table("ReadingGrabCache")->add( { title => $title, url => +"Foo" }); my $test = $DB->table("ReadingGrabCache")->select ( { url => "F +oo" })->fetchrow_hashref; print "BLA 1: $test->{title} \n<br>";


        ...which gives me:

        GOT TITLE BEFORE: Румыния не будет участвовать в «Евровидении-2016» из-за денег - Газета.Ru

        GOT TITLE: Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru

        BLA 1: Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru

        After grabbing it back from the DB, if I do this:

        $test->{title} = decode("utf-8",$title);

        ...then it works! But it seems bonkers to have to do that, when it should be in utf8 already. Is there a way to "check" if a string has the correct utf8 markers?

        Thanks!

        Andy
      Here is an example code, of where the issue is coming from:

      use LWP::UserAgent; use HTTP::Request::Common qw(GET); my $ua = LWP::UserAgent->new; # Define user agent type $ua->agent('Mozilla/8.0'); # Request object my $req = GET 'http://www.gazeta.ru/culture/2016/04/22/a_8191769.s +html'; # Make the request my $res = $ua->request($req); binmode STDOUT, ":utf8"; print "Content-Type: text/html; charset=utf-8 \n\n"; use Encode; if ($res->is_success) { my $title; $res->decoded_content =~ /<title>(.+?)<\/title>/ and $title = +$1; # prints correctly here! print "GOT TITLE: $title \n"; $DB->table("ReadingGrabCache")->add( { title => $title, url => + "Foo" }); my $test = $DB->table("ReadingGrabCache")->select ( { url => " +Foo" })->fetchrow_hashref; # buggered content here print "BLA: $test->{title} \n<br>"; } else { print $res->status_line . "\n"; }


      The DB module is encoding insensative (i.e its not doing any kind of conversion) ... so I'm confused how it can be fine here, and then broken when grabbed back :(

      Checking it in phpmyAdmin also shows the issue:

      Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru

      The table is quite simple... but maybe I've missed something:

      CREATE TABLE IF NOT EXISTS `ReadingGrabCache` ( `grab_id` int(11) NOT NULL AUTO_INCREMENT, `url` varchar(255) CHARACTER SET latin1 NOT NULL, `images` text CHARACTER SET latin1 NOT NULL, `title` text COLLATE utf8_bin NOT NULL, `description` text COLLATE utf8_bin NOT NULL, `all_images` longtext CHARACTER SET latin1, PRIMARY KEY (`grab_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT= +141 ;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1162860]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-18 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found