emilioayllon has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

After a whole afternoon trying to find out the problem I seek the wisdom of PerlMonks.

I've got a unicode file where, after processing, some lines will go into a Mysql (4.0.14) DB.

Using perl 5.8.0 I specified that the file is in utf8 by:

$txt_opener = new IO::File;<br> open( $txt_opener,'<:utf8', "$filetxt");
A little bit of processing later the script prints on console the data I want to insert into the DB and all foreign characters seem to be fine.

Once this goes into the DB using a standard SQL command and perl DBI driver the query executes fine.

However, using the console and/or phpmyadmin I interrogate the database and foreign characters ie. ñ, í are corrupted.

when the table is created the character set is specified "CHARACTER SET utf8".

Finally, if in this very same script I set a variable to a fix string with foreign characters ie. $var='ñíó' and execute using the same procedure the whole thing works fine.

So, the question is why if the variable gets assigned to a fixed string the thing works but if it comes from a dynamic assignment it fails to insert the data correctly.

I guess the problem is more related to perl than to mysql but sorry to ask in the wrong forum if this is not the case

Many thanks for your help!

Emilio.

Edit, BazB: added code tags.

Replies are listed 'Best First'.
Re: Perl Mysql and UTF8
by etcshadow (Priest) on Nov 02, 2003 at 02:07 UTC
    Are you binding these character strings? If not, try it. If you are, I'd look into whether or not there is some means of specifying the character encoding of the bind parameter. Perhaps like (ficticious examples follow!):
    $sth->bind_param(1,$value,SQL_VARCHAR_UTF8)
    or perhaps more like
    $sth->bind_param(1,$value,{TYPE=>SQL_VARCHAR, ENCODING=>UTF8})
    I don't know details, but maybe this can guide your search for further information.

    I don't see anything specific about this in the DBI or DBD::mysql docs... but you could try playing around with it a little.


    ------------
    :Wq
    Not an editor command: Wq
      I tried using $var = $dbh->quote($var) but unfortunately this didnt seem to cure the problem.

      Furthermore, if I print out the new $var on console the foreign characters are ruined.

      Thanks for trying.

        Oh! Well, that means that it is $dbh->quote(...) that is hosing your data. I would really recommend that you look into using binds for your dynamic literals rather than quoting them.

        If you are not familiar with binds (or "placeholders", depending on the documentation source), then you should really look into them. The basic idea is that, instead of doing this:

        my $quotedvar = $dbh->quote($var); my $sql = "insert into mytable (mycolumn) values ($quotedvar)"; $dbh->do($sql);
        do this:
        my $sql = "insert into mytable (mycolumn) values (?)"; $dbh->do($sql,undef,$var);
        But you should definiftely consult the DBI docs for more info on this (see the section under "Placeholder and Bind Values").

        I'm really not at all surprised that $dbh->quote(...) is mangling your data, because, well, that's what its job is, and apparently it is just getting a little bit over-zealous with its mangling. More to the point, it may not be character-encoding-aware... wheras simply binding the value just passes it directly through to the DBD. And, in the case of mysql, the DBD is probably capable of handling character-encoding well, since there is explicit functinoality around it in the database.


        ------------
        :Wq
        Not an editor command: Wq
Re: Perl Mysql and UTF8
by graff (Chancellor) on Nov 02, 2003 at 15:17 UTC
    I was not able to replicate your results. I loaded utf8 text data from a file into a table and got it back unharmed, using both the mysql command line utility and a perl DBI script.

    The one difference between my Perl/DBI attempt and yours might be that I prepared the insert (or update) statement with a placeholder where the text value would go, rather than trying to build the statement with the actual text value in place and trying to quote it somehow -- that is:

    my $db = DBI->connect( blah, blah ); my $sql = $db->prepare( "insert into my_table (col1,col2) values (?,?) +" ); while (<>) { # assume we are reading rows of data... chomp; my ( $v1, $v2 ) = split( /\t/ ); # ... tab-delimited $sql->execute( $v1, $v2 ); } $sql->finish; # now try reading stuff back: $sql = $db->prepare( "select col1,col2 from my_table" ); $sql->execute; # (update: forgot to put this in at first) my $rowref = $sql->fetchall_arrayref; for my $row ( @$rowref ) { print join( "\t", @$row ), $/; } $sql->finish;
    As a strange little aside: you may need to be careful in your perl script about setting the character semantics on whatever file handles you use for input and output. If both are supposed to be utf8, then I suggest being explicit about that in your perl code (e.g. include  binmode(STDOUT,":utf8"); if you're writing stuff to STDOUT).

    As for controlling character semantics in the database transactions, in general I'd say don't -- data is data as far as the RDBMS is concerned (mysql or other), and whatever byte sequence you put in, that's what you'll get back, so long as you use DBI's placeholder / parameter syntax for putting data values into the sql statements.