cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks. I am having a fun Saturday afternoon trying to get Perl to play nice with MySQL to move Arabic text between two schemas. Here is the relevant code chunk (source db has already been connected, etc):
my $sqlstatement = 'select name_arabic from data'; my $sth = $dbh->prepare($sqlstatement); my @row; my @cases; $sth->execute || die "Could not execute MySQL statement: $sqlstatement +"; while (@row=$sth->fetchrow_array) { push(@cases, [ @row ]); } $sth->finish(); for my $i (0..$#cases) { my $sqlstatement = 'insert into newdb.names (arabicname) values (? +)'; my $sth = $dbh->prepare($sqlstatement); $sth->execute($cases[$i][0]) || die "putsql could not execute MySQ +L statement: $sqlstatement"; $sth->finish(); }
In both the existing source table and the newdb table, the column charset is set to cp1256 (which is windoze arabic). When I look at the data table in the source db with a viewer that will show that charset, it shows the arabic chars.

After running the above code, when I look at newdb.names with the same viewer, it shows question marks substituted for the arabic chars.

So it would seem that in the course of reading and/or writing the data I am hosing up the char encoding somehow. Anyone know how I can straighten it out?

Thanks...Steve

Replies are listed 'Best First'.
Re: Trouble with Perl MySQL Arabic
by graff (Chancellor) on Aug 20, 2006 at 05:24 UTC
    The question marks indicate that something has tried (or is trying) to treat the cp1256 characters as utf8, and failing. That is, a string of cp1256 bytes has been (is being) given to some tool or function that expects some form of unicode (probably utf8), and since the cp1256 string cannot be mistaken for valid utf8 multi-byte ("wide") characters, the result is a string of "????", to reflect that the byte sequence is not parsable as utf8.

    The code you've shown gives no evidence that this sort of misinterpretation would happen. So the problem involves something you haven't told us about yet.

    Is there anything else different between the two data versions besides the schema? (Different database, different server, different machine, different OS?) When you say "with the same viewer", what is that, and is it really the same executable on the same machine showing both versions of the data?

    (Update: By any chance, is your perl script running on Red Hat 9 with perl 5.8.0? If so, check your locale setting -- this might be "coercing" the data to utf8 for you "by default" in some way.)

    Have you tried using the "mysqlimport" tool, or the "LOAD DATA INFILE" operation, instead of doing inserts with DBI? Dumping the names from the old schema to a plain-text (cp1256) file, and then using that text file with mysqlimport, would at least be quicker than doing a series of DBI inserts (maybe not noticeable over a small data set, but if you have more than a few thousand rows, you'll see a speed difference).

    But apart from efficiency, you also have the ability to check the data coming out of the old schema, and if you dump the data from the new schema the same way to a distinct text file, you'll be able to tell right away whether the load/retrieval on the new table is the problem (as opposed to the display of the data), just by comparing the two plain-text files.

    (update: fixed a mis-spelling of "mysqlimport")

      i agree with graff, that it's probably a problem that ocurrs while recoding data between charsets. It might fail at either of the two points: reading or writing.
      You should try to determine which charset the data ist you get from your dbi call. It might be correctly converted to utf8, or erroneously be treated as utf8 even though it isn't.
      If it is converted to utf8 correctly, then an explicit recode might be required for writing back the data (utf8 to cp1256).
      If it reads the input data as correct unicode data, it might also be that the viewer you are using is only showing questionmarks, because you inserted unicode data and the viewer can't handle it.
      Check the codepage of the data you are getting and the codepage you want to write it back with. Judging by your problem it seems to be required that you recode your data at least once, and for that you'll have to know the input codepage and the desired output codepage.
      Is there anything else different between the two data versions besides the schema?
      Nope, same machine same, MySQL server.
      When you say "with the same viewer", what is that
      Navicat MySQL. Same when I export the files to Excel and look at them with that.
      is your perl script running on Red Hat 9 with perl 5.8.0
      Running windoze with 5.8.4
      Have you tried using the "msqlimport" tool
      It's a good suggestion, and I may try it if I get desperate. I'm trying to do this with Perl bc this is a friend's db that's in MS Access an is not normlized. So I'm trying to both get it into MySQL and rearrange the tables. Navicat MySQL does the import from MS Access to MySQL fine so now I'm trying to move from the schema that was imported from Access to a new refactored schema.
        I don't think mysqlimport should be left as a "last desperate measure", and I don't see why your situation would disfavor it, so my advice would be to try that next. Any other cp1256-based tool you have that can do Arabic display should be able to show the plain-text file that you dump from the old table, to confirm that the data is okay up to that point. Then, load to and dump from the new table, and the data should be basically the same (especially if you sort it the same way from both versions).

        If the data in the "fixed" table still looks buggy, you might try the following alterations to the OP code, just to see if this makes a difference (I don't know whether it will):

        use Encode; # add this near the top my $sqlstatement = 'select name_arabic from data'; my $sth = $dbh->prepare($sqlstatement); $sth->execute || die "Could not execute MySQL statement: $sqlstatement +"; my @row; my @cases; # if handling only one field, let's make this a plain array while (@row=$sth->fetchrow_array) { # explicitly convert to utf8 (die +on failure) push @cases, decode( 'cp1256', $row[0], Encode::FB_CROAK ); } $sth->finish(); # optimize a little: the insert statement only needs to be prepared on +ce $sqlstatement = 'insert into newdb.names (arabicname) values (?)'; $sth = $dbh->prepare($sqlstatement); for my $i (0..$#cases) { # now explicitly convert back to cp1256 $sth->execute( encode( 'cp1256', $cases[$i] )) or die "putsql could not execute MySQL statement: $sqlstateme +nt"; } $sth->finish();
        (updated to remove redundant "my" declarations on $sqlstatement and $sth)
Re: Trouble with Perl MySQL Arabic
by blockcipher (Beadle) on Aug 21, 2006 at 14:15 UTC

    Just a quick little piece of advice: check your fonts. Not all fonts display the same characters. Your font may not support arabic characters, which would cause you to not see the correct text. If you're working in a GUI, Arial-Unicode works pretty well.

    One way to test, but maybe not the best way, is to fire up a text editor that supports using non-latin1 fonts, set the font to what you'd like to try, and load up some arabic text.