Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^3: UTF8 issue when getting website via LWP::UserAgent in Perl

by ultranerds (Hermit)
on May 12, 2016 at 14:40 UTC ( [id://1162865]=note: print w/replies, xml ) Need Help??


in reply to Re^2: UTF8 issue when getting website via LWP::UserAgent in Perl
in thread UTF8 issue when getting website via LWP::UserAgent in Perl

Oh actually... maybe it didn't :/ It works the first time you print it out (directly) ... but then when saved to the DB, it gets corrupt (even though the DB is in utf_bin)

Any other ideas? Who would have thought this would be such a royal PITA! I wish we just all moved over to one charset ;) Here is what a dumper of the values looks like:
$VAR1 = { 'title_flag' => 0, 'og_desc' => "\x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438}\x{4 +4f} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{436}\x{435}\x{442} \x{43f} +\x{43e}\x{43a}\x{430}\x{437}\x{44b}\x{432}\x{430}\x{442}\x{44c} \x{44 +2}\x{435}\x{43b}\x{435}\x{432}\x{438}\x{437}\x{438}\x{43e}\x{43d}\x{4 +3d}\x{44b}\x{439} \x{43a}\x{43e}\x{43d}\x{43a}\x{443}\x{440}\x{441} \ +x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{43d}\ +x{438}\x{435}-2016\x{bb}, \x{43f}\x{435}\x{432}\x{435}\x{446} \x{41e} +\x{432}\x{438}\x{434}\x{438}\x{443} \x{410}\x{43d}\x{442}\x{43e}\x{43 +d} \x{43d}\x{435} \x{432}\x{44b}\x{441}\x{442}\x{443}\x{43f}\x{438}\x +{442} \x{432} \x{421}\x{442}\x{43e}\x{43a}\x{433}\x{43e}\x{43b}\x{44c +}\x{43c}\x{435}, \x{430} \x{440}\x{443}\x{43c}\x{44b}\x{43d}\x{441}\x +{43a}\x{438}\x{435} \x{442}\x{435}\x{43b}\x{435}\x{437}\x{440}\x{438} +\x{442}\x{435}\x{43b}\x{438} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{4 +33}\x{443}\x{442} \x{43f}\x{440}\x{43e}\x{433}\x{43e}\x{43b}\x{43e}\x +{441}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{437}\x{430} \x{43f}\x{43e +}\x{43d}\x{440}\x{430}\x{432}\x{438}\x{432}\x{448}\x{438}\x{445}\x{44 +1}\x{44f} \x{43c}\x{443}\x{437}\x{44b}\x{43a}\x{430}\x{43d}\x{442}\x{ +43e}\x{432} \x{2014} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{43e}\x{4 +3b}\x{433}\x{430} \x{432} 16 \x{43c}\x{43b}\x{43d} \x{448}\x{432}\x{4 +35}\x{439}\x{446}\x{430}\x{440}\x{441}\x{43a}\x{438}\x{445} \x{444}\x +{440}\x{430}\x{43d}\x{43a}\x{43e}\x{432}.", 'title' => " \x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438}\x{44 +f} \x{43d}\x{435} \x{431}\x{443}\x{434}\x{435}\x{442} \x{443}\x{447}\ +x{430}\x{441}\x{442}\x{432}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{432 +} \x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{43 +d}\x{438}\x{438}-2016\x{bb} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{4 +35}\x{43d}\x{435}\x{433} - \x{413}\x{430}\x{437}\x{435}\x{442}\x{430} +.Ru ", 'charset' => 'windows-1251', 'og_image' => ' http://img.gazeta.ru/files3/123/8192123/rumi +n-pic905-895x505-99564.jpg', 'description' => "\x{420}\x{443}\x{43c}\x{44b}\x{43d}\x{438} +\x{44f} \x{43d}\x{435} \x{441}\x{43c}\x{43e}\x{436}\x{435}\x{442} \x{ +43f}\x{43e}\x{43a}\x{430}\x{437}\x{44b}\x{432}\x{430}\x{442}\x{44c} \ +x{442}\x{435}\x{43b}\x{435}\x{432}\x{438}\x{437}\x{438}\x{43e}\x{43d} +\x{43d}\x{44b}\x{439} \x{43a}\x{43e}\x{43d}\x{43a}\x{443}\x{440}\x{44 +1} \x{ab}\x{415}\x{432}\x{440}\x{43e}\x{432}\x{438}\x{434}\x{435}\x{4 +3d}\x{438}\x{435}-2016\x{bb}, \x{43f}\x{435}\x{432}\x{435}\x{446} \x{ +41e}\x{432}\x{438}\x{434}\x{438}\x{443} \x{410}\x{43d}\x{442}\x{43e}\ +x{43d} \x{43d}\x{435} \x{432}\x{44b}\x{441}\x{442}\x{443}\x{43f}\x{43 +8}\x{442} \x{432} \x{421}\x{442}\x{43e}\x{43a}\x{433}\x{43e}\x{43b}\x +{44c}\x{43c}\x{435}, \x{430} \x{440}\x{443}\x{43c}\x{44b}\x{43d}\x{44 +1}\x{43a}\x{438}\x{435} \x{442}\x{435}\x{43b}\x{435}\x{437}\x{440}\x{ +438}\x{442}\x{435}\x{43b}\x{438} \x{43d}\x{435} \x{441}\x{43c}\x{43e} +\x{433}\x{443}\x{442} \x{43f}\x{440}\x{43e}\x{433}\x{43e}\x{43b}\x{43 +e}\x{441}\x{43e}\x{432}\x{430}\x{442}\x{44c} \x{437}\x{430} \x{43f}\x +{43e}\x{43d}\x{440}\x{430}\x{432}\x{438}\x{432}\x{448}\x{438}\x{445}\ +x{441}\x{44f} \x{43c}\x{443}\x{437}\x{44b}\x{43a}\x{430}\x{43d}\x{442 +}\x{43e}\x{432} \x{2014} \x{438}\x{437}-\x{437}\x{430} \x{434}\x{43e} +\x{43b}\x{433}\x{430} \x{432} 16 \x{43c}\x{43b}\x{43d} \x{448}\x{432} +\x{435}\x{439}\x{446}\x{430}\x{440}\x{441}\x{43a}\x{438}\x{445} \x{44 +4}\x{440}\x{430}\x{43d}\x{43a}\x{43e}\x{432}." };
...and here is the outputted JSON:

{"page_title":" Румыния не будет участвовать в «Евровидении-2016» из-за денег - Газета.Ru ","description":"Румыния не сможет показывать телевизионный конкурс «Евровидение-2016», певец Овидиу Антон не выступит в Стокгольме, а румынские телезрители не смогут проголосовать за понравившихся музыкантов — из-за долга в 16 млн швейцарских франков."}

..yet here is what comes back out:
$VAR1 = { 'images' => '', 'all_images' => '{"image_loop":["http://static.gazeta.ru/nm2 +012/i/quotes/finam_head.png","/nm2015//gzt/img/logo.png"," http://img +.gazeta.ru/files3/123/8192123/rumin-pic905-895x505-99564.jpg"," http: +//img.gazeta.ru/files3/885/8195885/igra-pic265-265x150-77294.jpg"," h +ttp://img.gazeta.ru/files3/725/7953725/RIAN_02710972.HR.ru-pic410-410 +x230-99945.jpg"," http://img.gazeta.ru/files3/331/8116331/2016-02-22T +104304Z_1519170817_D1AESOMBZKAD_RTRMADP_3_UKRAINE-TATARS-EUROVISION-p +ic410-410x230-5670.jpg","http://static.smi2.net/srcimg/2780020.png"," +/nm2015/gzt/img/logo_footer.png","http://static.gazeta.ru/nm2012/i/re +uters_a2.png","http://static.gazeta.ru/nm2012/i/prime_a2.png","http:/ +/static.gazeta.ru/nm2012/i/interfax_a2.png","http://static.gazeta.ru/ +nm2012/i/ria_a2.png","http://static.gazeta.ru/nm2012/i/it_a3.png","ht +tp://static.gazeta.ru/nm2012/i/lj_a2.png"," http://img.gazeta.ru/file +s3/123/8192123/rumin-pic905-895x505-99564.jpg"]}', 'url' => 'www.gazeta.ru/culture/2016/04/22/a_8191769.shtml', 'title' => ' Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ&#1 +30; ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ&# +128;ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ&#13 +0;Π°.Ru ', 'description' => 'Румыния Π½Π΅ смоТСΡ& +#130; ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°Ρ‚ΡŒ Ρ‚Π΅Π»Π΅Π²ΠΈΠ·ΠΈΠΎΠ½Π½ +Ρ‹ΠΉ конкурс Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠ΅ +-2016Β», ΠΏΠ΅Π²Π΅Ρ† ΠžΠ²ΠΈΠ΄ΠΈΡƒ Антон Π½Π΅ +выступит Π² Π‘Ρ‚ΠΎΠΊΠ³ΠΎΠ»ΡŒΠΌΠ +΅, Π° румынскиС Ρ‚Π΅Π»Π΅Π·Ρ€ΠΈΡ&#13 +0;Π΅Π»ΠΈ Π½Π΅ смогут проголосоваΡ&#13 +0;ΡŒ Π·Π° ΠΏΠΎΠ½Ρ€Π°Π²ΠΈΠ²ΡˆΠΈΡ…ΡΡ ΠΌΡƒΠ·Ρ +‹ΠΊΠ°Π½Ρ‚ΠΎΠ² β€” ΠΈΠ·-Π·Π° Π΄ΠΎΠ»Π³Π° Π² 16 ΠΌΠ» +Π½ ΡˆΠ²Π΅ΠΉΡ†Π°Ρ€ΡΠΊΠΈΡ… Ρ„Ρ€Π°Π½ΠΊΠΎΠ +².', 'grab_id' => '133' }; {"page_title":" Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ Ρ&# +131;Ρ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ +²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru + ","description":"Румыния Π½Π΅ смоТСт ΠΏΠ +ΎΠΊΠ°Π·Ρ‹Π²Π°Ρ‚ΡŒ Ρ‚Π΅Π»Π΅Π²ΠΈΠ·ΠΈΠΎΠ½Π½Ρ‹ΠΉ + конкурс Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠ΅-2016Β», +ΠΏΠ΅Π²Π΅Ρ† ΠžΠ²ΠΈΠ΄ΠΈΡƒ Антон Π½Π΅ Π²Ρ‹ +ступит Π² Π‘Ρ‚ΠΎΠΊΠ³ΠΎΠ»ΡŒΠΌΠ΅, Π° Ρ&# +128;умынскиС Ρ‚Π΅Π»Π΅Π·Ρ€ΠΈΡ‚Π΅Π»ΠΈ +Π½Π΅ смогут ΠΏΡ€ΠΎΠ³ΠΎΠ»ΠΎΡΠΎΠ²Π°Ρ‚ΡŒ + Π·Π° ΠΏΠΎΠ½Ρ€Π°Π²ΠΈΠ²ΡˆΠΈΡ…ΡΡ ΠΌΡƒΠ·Ρ‹ΠΊΠ +°Π½Ρ‚ΠΎΠ² β€” ΠΈΠ·-Π·Π° Π΄ΠΎΠ»Π³Π° Π² 16 ΠΌΠ»Π½ Ρ&#136 +;вСйцарских Ρ„Ρ€Π°Π½ΠΊΠΎΠ².","cach +ed":1}


Cheers

Andy

Replies are listed 'Best First'.
Re^4: UTF8 issue when getting website via LWP::UserAgent in Perl
by Your Mother (Archbishop) on May 12, 2016 at 15:06 UTC

    Your DB handle must understand what it's getting (or else you'll have to do your own encode/decode in and out of it) and your DB column charset must be set to what you think it is. Search utf-8 in the Pod for whatever DBI DBD driver you are using and make sure the column definition in your table is what you expect/need.

      Hi,

      Mmm.. its almost like its being double encoded... Here is an example:

      print "GOT TITLE BEFORE: $title \n<Br><Br>"; $title = encode("utf-8",$title); print "GOT TITLE: $title \n<Br><Br>"; $DB->table("ReadingGrabCache")->add( { title => $title, url => +"Foo" }); my $test = $DB->table("ReadingGrabCache")->select ( { url => "F +oo" })->fetchrow_hashref; print "BLA 1: $test->{title} \n<br>";


      ...which gives me:

      GOT TITLE BEFORE: Румыния не будет участвовать в «Евровидении-2016» из-за денег - Газета.Ru

      GOT TITLE: Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru

      BLA 1: Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru

      After grabbing it back from the DB, if I do this:

      $test->{title} = decode("utf-8",$title);

      ...then it works! But it seems bonkers to have to do that, when it should be in utf8 already. Is there a way to "check" if a string has the correct utf8 markers?

      Thanks!

      Andy

        So, I know it's all confusing. Took me forever. But it's actually really simple. A string of bytes is nothing. It's just binary data. You have to know what it's supposed to be and tell your code when coming from binary and going back to it. The raw stuff doesn't know (well, some charsets do have BOM flags but it's not something on which you can rely here). Your DBI/DBD driver can do the encode/decode two-step for you automatically as I suggested (might work even if table definition is wrong but it's best to ensure it's in agreement). :P Examples of the setting to check include–

        • DBD::mysql -> mysql_enable_utf8
          • This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off.
        • DBD::SQLite -> sqlite_unicode
          • If the attribute $dbh->{sqlite_unicode} is set, strings coming from the database and passed to the collation function will be properly tagged with the utf8 flag; but this only works if the attribute is set before the first call to a perl collation sequence . The recommended way to activate unicode is to set the sqlite_unicode parameter at connection time
        • DBD::Pg -> pg_enable_utf8 (integer)
          • DBD::Pg specific attribute. The behavior of DBD::Pg with regards to this flag has changed as of version 3.0.0. The default value for this attribute, -1, indicates that the internal Perl utf8 flag will be turned on for all strings coming back from the database if the client_encoding is set to 'UTF8'. Use of this default is highly encouraged. If your code was previously using pg_enable_utf8, you can probably remove mention of it entirely. :\

        Update: s/simply/simple/;

        $title = encode("utf-8",$title);

        Why do you do that?

Re^4: UTF8 issue when getting website via LWP::UserAgent in Perl
by ultranerds (Hermit) on May 12, 2016 at 14:58 UTC
    Here is an example code, of where the issue is coming from:

    use LWP::UserAgent; use HTTP::Request::Common qw(GET); my $ua = LWP::UserAgent->new; # Define user agent type $ua->agent('Mozilla/8.0'); # Request object my $req = GET 'http://www.gazeta.ru/culture/2016/04/22/a_8191769.s +html'; # Make the request my $res = $ua->request($req); binmode STDOUT, ":utf8"; print "Content-Type: text/html; charset=utf-8 \n\n"; use Encode; if ($res->is_success) { my $title; $res->decoded_content =~ /<title>(.+?)<\/title>/ and $title = +$1; # prints correctly here! print "GOT TITLE: $title \n"; $DB->table("ReadingGrabCache")->add( { title => $title, url => + "Foo" }); my $test = $DB->table("ReadingGrabCache")->select ( { url => " +Foo" })->fetchrow_hashref; # buggered content here print "BLA: $test->{title} \n<br>"; } else { print $res->status_line . "\n"; }


    The DB module is encoding insensative (i.e its not doing any kind of conversion) ... so I'm confused how it can be fine here, and then broken when grabbed back :(

    Checking it in phpmyAdmin also shows the issue:

    Румыния Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΡƒΡ‡Π°ΡΡ‚Π²ΠΎΠ²Π°Ρ‚ΡŒ Π² Β«Π•Π²Ρ€ΠΎΠ²ΠΈΠ΄Π΅Π½ΠΈΠΈ-2016Β» ΠΈΠ·-Π·Π° Π΄Π΅Π½Π΅Π³ - Π“Π°Π·Π΅Ρ‚Π°.Ru

    The table is quite simple... but maybe I've missed something:

    CREATE TABLE IF NOT EXISTS `ReadingGrabCache` ( `grab_id` int(11) NOT NULL AUTO_INCREMENT, `url` varchar(255) CHARACTER SET latin1 NOT NULL, `images` text CHARACTER SET latin1 NOT NULL, `title` text COLLATE utf8_bin NOT NULL, `description` text COLLATE utf8_bin NOT NULL, `all_images` longtext CHARACTER SET latin1, PRIMARY KEY (`grab_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT= +141 ;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1162865]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-03-28 15:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found