Kanishka has asked for the wisdom of the Perl Monks concerning the following question:

Venerable sirs!

The result written into the database table by this script below is unreadable. Eg: "Asistente de Dirección" = "Asistente de Dirección"

But if i write the resluts into a text file it is readable. Where have i gone wrong? (Ignore the functions load and Adump)

#!/usr/bin/perl -w use strict; use LWP::UserAgent; use HTTP::Cookies; #use lib '/Perl/site/lib/myModules';s use myModules::initHTTP qw(:all); use DBI; initHttp('http://water:8080'); ###################################################################### +#### my ($dbh) = DBI->connect('DBI:ODBC:xxxx', 'xcx', '') or die "Couldn't +connect to database: " . DBI->errstr; ###################################################################### +#### getAll(); ###################################################################### +#### $dbh->disconnect(); ###################################################################### +#### sub getAll{ print "Fetching All records\n"; my $rs = sendRequest('GET', 'http://www.mycareer.loreal.com/candid +ate/searchjobresults_xml.asp?l=15'); #fetch page while ($rs =~ m!vacancytitle="(.*?)".*?country="(.*?)" startdate=" +(.*?)".*?employmentfield="(.*?)".*?/>!isg){ print $1."\t".$2."\t".$3."\t".$4."\n"; insert($1, $2, $3, $4); } #Adump($rs);exit; } sub insert{ my ($vacancytitle, $country, $startdate, $employmentfield) = @_; my $sth2 = $dbh->prepare('insert into xxxx..Loreal (vacancytitle, +country, startdate, employmentfield) values (?, ?, ?, ?)') or die "Co +uldn't prepare UPDATE statement: " . $dbh->errstr; # this line is onl +y for the Itchy server $sth2->execute ($vacancytitle, $country, $startdate, $employmentfi +eld) or die "Couldn't execute INSERT statement: " . $sth2->errstr; $sth2->finish; } sub load { $/ = undef; open OUT, 'optList.asp'; my $p = <OUT>; close OUT; return $p; } sub Adump { my $lst = shift; open OUT, '>Loreal.xml'; print OUT $lst; close OUT; }

Formatting added by GrandFather

2006-10-16 Retitled by GrandFather, as per Monastery guidelines
Original title: 'The result is unreadable'

  • Comment on Fine when written to text file, but unreadable when written to database table
  • Download Code

Replies are listed 'Best First'.
Re: Fine when written to text file, but unreadable when written to database table
by shmem (Chancellor) on Oct 16, 2006 at 06:33 UTC
    It seems that you expect Latin1, but your output is utf-8. Read Encode.
    perl -MEncode=from_to -le '$_="Dirección";from_to($_,"utf-8","iso-885 +9-1");print' Dirección

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Fine when written to text file, but unreadable when written to database table
by graff (Chancellor) on Oct 16, 2006 at 07:12 UTC
    It looks like the "ó" character is probably being stored in the database as a two-byte utf8 "wide" character. It turns out that when ó is encoded in utf8, the two-byte sequence is "\xC3\xB3"; you can look those up in a Latin1 chart and see that if these two bytes were treated as separate characters, they would come out as "Ã" and "superscript 3".

    So somewhere in your setup, you are storing strings in utf8, and then somewhere else, you are treating them as if they were not utf8, but rather some single-byte encoding such as iso-8859-1 or cp1252.

    You haven't given us enough information to tell where the problem is. Maybe the utf8 string is contained in the web page that you fetch, and is being stored in the database as the two-byte sequence. When you read that back from the database, the two-byte utf8 character might be getting displayed "as-is" on a 8859-1 or cp1252 display, or it could be that the two bytes are each being "upgraded" to utf8 characters and you're seeing à and superscript-3 on a utf8 display.

    Whatever the problem, you just need to be explicit about what encoding is being used at each step of your process, and maybe do some encoding "conversions" at appropriate points.

    If the database contains utf8 strings, and you use a Perl script to read stuff back from the database, Perl probably won't be able to know automatically that the string contains utf8 "wide" characters, and you'll need to use Encode to make that explicit:

    use Encode; # assume the $string contains a value fetched from the database: $string = decode( "utf8", $string ); # sets the "utf8-flag" on $strin +g;
    If that doesn't help, and you can't figure out what really needs to be done, you'll need to give us more information: What OS are you using, and are you using a utf8-based locale? What are you using to view the text data? Can you confirm whether the string is being stored in the database as utf8?
      I use windows 2000 as my platform.

      I use MSSQL.

      THe collation used there is 'SQL_Latin1_General_CP850_CI_AI'

      The downloaded file is an XML file. It uses 'encoding="UTF-8'
        In that case, you should either convert the strings from utf8 to cp850 before you store them to the database, or else you should convert them after fetching them back from the database, before you print them to a file or display them. (See the "from_to" function in Encode.)

        Whatever you do, make sure the database content always has the same encoding for all text data. Mixing different encodings into a single database would be as bad as mixing them in a single paragraph -- it becomes impossible (or at least terribly difficult) to make the data coherent.

        (It is possible to have a table with different fields using different encodings; you could even have pairs of fields, like "name" and "name_encoding" so that the encoding of "name" is specified for each row, but that's more trouble than it's worth. Keep it simple.)

Re: Fine when written to text file, but unreadable when written to database table
by tilly (Archbishop) on Oct 16, 2006 at 04:58 UTC
    Please be a little clearer about what you mean by "unreadable".

    That said, it looks like you're missing a $dbh->commit; line, so you likely are inserting data only to see it rolled back automatically when you disconnect.

      Eg: "Asistente de Dirección" shows like "Asistente de Dirección".
      BTW the DBH part works fine!