Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^4: UTF8 issue when getting website via LWP::UserAgent in Perl

by Your Mother (Archbishop)
on May 12, 2016 at 15:06 UTC ( [id://1162867]=note: print w/replies, xml ) Need Help??


in reply to Re^3: UTF8 issue when getting website via LWP::UserAgent in Perl
in thread UTF8 issue when getting website via LWP::UserAgent in Perl

Your DB handle must understand what it's getting (or else you'll have to do your own encode/decode in and out of it) and your DB column charset must be set to what you think it is. Search utf-8 in the Pod for whatever DBI DBD driver you are using and make sure the column definition in your table is what you expect/need.

  • Comment on Re^4: UTF8 issue when getting website via LWP::UserAgent in Perl

Replies are listed 'Best First'.
Re^5: UTF8 issue when getting website via LWP::UserAgent in Perl
by ultranerds (Hermit) on May 12, 2016 at 15:13 UTC
    Hi,

    Mmm.. its almost like its being double encoded... Here is an example:

    print "GOT TITLE BEFORE: $title \n<Br><Br>"; $title = encode("utf-8",$title); print "GOT TITLE: $title \n<Br><Br>"; $DB->table("ReadingGrabCache")->add( { title => $title, url => +"Foo" }); my $test = $DB->table("ReadingGrabCache")->select ( { url => "F +oo" })->fetchrow_hashref; print "BLA 1: $test->{title} \n<br>";


    ...which gives me:

    GOT TITLE BEFORE: Румыния не будет участвовать в Евровидении-2016 из-за денег - Газета.Ru

    GOT TITLE: ƒм‹ния не бƒде‚ ƒ‡ас‚вова‚Œ в «•в€овидении-2016» из-за денег - “азе‚а.Ru

    BLA 1: ƒм‹ния не бƒде‚ ƒ‡ас‚вова‚Œ в «•в€овидении-2016» из-за денег - “азе‚а.Ru

    After grabbing it back from the DB, if I do this:

    $test->{title} = decode("utf-8",$title);

    ...then it works! But it seems bonkers to have to do that, when it should be in utf8 already. Is there a way to "check" if a string has the correct utf8 markers?

    Thanks!

    Andy

      So, I know it's all confusing. Took me forever. But it's actually really simple. A string of bytes is nothing. It's just binary data. You have to know what it's supposed to be and tell your code when coming from binary and going back to it. The raw stuff doesn't know (well, some charsets do have BOM flags but it's not something on which you can rely here). Your DBI/DBD driver can do the encode/decode two-step for you automatically as I suggested (might work even if table definition is wrong but it's best to ensure it's in agreement). :P Examples of the setting to check include–

      • DBD::mysql -> mysql_enable_utf8
        • This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off.
      • DBD::SQLite -> sqlite_unicode
        • If the attribute $dbh->{sqlite_unicode} is set, strings coming from the database and passed to the collation function will be properly tagged with the utf8 flag; but this only works if the attribute is set before the first call to a perl collation sequence . The recommended way to activate unicode is to set the sqlite_unicode parameter at connection time
      • DBD::Pg -> pg_enable_utf8 (integer)
        • DBD::Pg specific attribute. The behavior of DBD::Pg with regards to this flag has changed as of version 3.0.0. The default value for this attribute, -1, indicates that the internal Perl utf8 flag will be turned on for all strings coming back from the database if the client_encoding is set to 'UTF8'. Use of this default is highly encouraged. If your code was previously using pg_enable_utf8, you can probably remove mention of it entirely. :\

      Update: s/simply/simple/;

        Thanks for the info. Man, this is a PITA :S Think I may have to take a break, and come back to it tomorrow.

        There is definitely something up - because even using basic DBI connection, it still messes it up:

        my $dsn = "DBI:mysql:database=$db_cfg->{database};host=$db_cfg->{h +ost};port=3307"; my $dbh = DBI->connect($dsn, $db_cfg->{login}, $db_cfg->{password} +); $dbh->{mysql_enable_utf8} = 1; my $sth = $dbh->prepare( "INSERT INTO ReadingGrabCache SET title = + ?" ); $sth->execute( $title ) or die $DBI::errstr;
        Eugh :/
      $title = encode("utf-8",$title);

      Why do you do that?

        Sorry, that was a mistake :) This is what I have:
        print "GOT TITLE BEFORE: $title \n<Br><Br>"; $DB->table("ReadingGrabCache")->add( { title => $title, url => + "Foo" }); my $test = $DB->table("ReadingGrabCache")->select ( { url => " +Foo" })->fetchrow_hashref; print "BEFORE RE-ENCODING: $test->{title} \n<br>"; $test->{title} = decode("utf-8",$test->{title}); print "AFTER RE-ENCODING: $test->{title} \n<br>";
        GOT TITLE BEFORE: Румыния не будет участвовать в Евровидении-2016 из-за денег - Газета.Ru

        BEFORE RE-ENCODING: (ie right out of the DB grab) ƒм‹ния не бƒде‚ ƒ‡ас‚вова‚Œ в «•в€овидении-2016» из-за денег - “азе‚а.Ru
        AFTER RE-ENCODING: Румыния не будет участвовать в Евровидении-2016 из-за денег - Газета.Ru

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1162867]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-04-18 02:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found