in reply to Trouble with Perl MySQL Arabic

The question marks indicate that something has tried (or is trying) to treat the cp1256 characters as utf8, and failing. That is, a string of cp1256 bytes has been (is being) given to some tool or function that expects some form of unicode (probably utf8), and since the cp1256 string cannot be mistaken for valid utf8 multi-byte ("wide") characters, the result is a string of "????", to reflect that the byte sequence is not parsable as utf8.

The code you've shown gives no evidence that this sort of misinterpretation would happen. So the problem involves something you haven't told us about yet.

Is there anything else different between the two data versions besides the schema? (Different database, different server, different machine, different OS?) When you say "with the same viewer", what is that, and is it really the same executable on the same machine showing both versions of the data?

(Update: By any chance, is your perl script running on Red Hat 9 with perl 5.8.0? If so, check your locale setting -- this might be "coercing" the data to utf8 for you "by default" in some way.)

Have you tried using the "mysqlimport" tool, or the "LOAD DATA INFILE" operation, instead of doing inserts with DBI? Dumping the names from the old schema to a plain-text (cp1256) file, and then using that text file with mysqlimport, would at least be quicker than doing a series of DBI inserts (maybe not noticeable over a small data set, but if you have more than a few thousand rows, you'll see a speed difference).

But apart from efficiency, you also have the ability to check the data coming out of the old schema, and if you dump the data from the new schema the same way to a distinct text file, you'll be able to tell right away whether the load/retrieval on the new table is the problem (as opposed to the display of the data), just by comparing the two plain-text files.

(update: fixed a mis-spelling of "mysqlimport")

Replies are listed 'Best First'.
Re^2: Trouble with Perl MySQL Arabic
by rminner (Chaplain) on Aug 20, 2006 at 09:03 UTC
    i agree with graff, that it's probably a problem that ocurrs while recoding data between charsets. It might fail at either of the two points: reading or writing.
    You should try to determine which charset the data ist you get from your dbi call. It might be correctly converted to utf8, or erroneously be treated as utf8 even though it isn't.
    If it is converted to utf8 correctly, then an explicit recode might be required for writing back the data (utf8 to cp1256).
    If it reads the input data as correct unicode data, it might also be that the viewer you are using is only showing questionmarks, because you inserted unicode data and the viewer can't handle it.
    Check the codepage of the data you are getting and the codepage you want to write it back with. Judging by your problem it seems to be required that you recode your data at least once, and for that you'll have to know the input codepage and the desired output codepage.
Re^2: Trouble with Perl MySQL Arabic
by cormanaz (Deacon) on Aug 20, 2006 at 21:22 UTC
    Is there anything else different between the two data versions besides the schema?
    Nope, same machine same, MySQL server.
    When you say "with the same viewer", what is that
    Navicat MySQL. Same when I export the files to Excel and look at them with that.
    is your perl script running on Red Hat 9 with perl 5.8.0
    Running windoze with 5.8.4
    Have you tried using the "msqlimport" tool
    It's a good suggestion, and I may try it if I get desperate. I'm trying to do this with Perl bc this is a friend's db that's in MS Access an is not normlized. So I'm trying to both get it into MySQL and rearrange the tables. Navicat MySQL does the import from MS Access to MySQL fine so now I'm trying to move from the schema that was imported from Access to a new refactored schema.
      I don't think mysqlimport should be left as a "last desperate measure", and I don't see why your situation would disfavor it, so my advice would be to try that next. Any other cp1256-based tool you have that can do Arabic display should be able to show the plain-text file that you dump from the old table, to confirm that the data is okay up to that point. Then, load to and dump from the new table, and the data should be basically the same (especially if you sort it the same way from both versions).

      If the data in the "fixed" table still looks buggy, you might try the following alterations to the OP code, just to see if this makes a difference (I don't know whether it will):

      use Encode; # add this near the top my $sqlstatement = 'select name_arabic from data'; my $sth = $dbh->prepare($sqlstatement); $sth->execute || die "Could not execute MySQL statement: $sqlstatement +"; my @row; my @cases; # if handling only one field, let's make this a plain array while (@row=$sth->fetchrow_array) { # explicitly convert to utf8 (die +on failure) push @cases, decode( 'cp1256', $row[0], Encode::FB_CROAK ); } $sth->finish(); # optimize a little: the insert statement only needs to be prepared on +ce $sqlstatement = 'insert into newdb.names (arabicname) values (?)'; $sth = $dbh->prepare($sqlstatement); for my $i (0..$#cases) { # now explicitly convert back to cp1256 $sth->execute( encode( 'cp1256', $cases[$i] )) or die "putsql could not execute MySQL statement: $sqlstateme +nt"; } $sth->finish();
      (updated to remove redundant "my" declarations on $sqlstatement and $sth)