substitution regex and unicode

frasco has asked for the wisdom of the Perl Monks concerning the following question:

Maybe I'm tired but really I can't solve such a situation! I have the following script retrieving data from MySQL:

#!/usr/bin/perl -w

use strict;
use warnings;
use DBI;
use CGI qw(:standard);
use CGI::Carp qw(fatalsToBrowser);
use utf8;
use Encode qw(decode decode);
binmode(STDOUT, ":encoding(utf8)");

my ($datasource, $user, $passw, $dbh, $sth);
my ($id_testo, $indice, $parole, $posizione);
my (@row, $field);

$datasource   = "DBI:mysql:database=Test;host=xxxxxxx;";
$user  = "xxxxxxxx";
$passw = "";
$dbh = DBI->connect($datasource, $user, $passw) || die "Error opening 
+db: $DBI::errstr\n";

$dbh->do("SET NAMES 'utf8'");

$sth = $dbh->prepare("SELECT indice, GROUP_CONCAT(parole SEPARATOR ' '
+) FROM testo GROUP BY indice");
$sth->execute();

print header(-type => "text/html", -charset => "utf-8"),
    start_html(-encoding => 'utf-8', "My_database"), "\n",
    h2("ARET 1.1"), "\n";

while (@row = $sth->fetchrow_array) {
    for $field(@row) {
    
    $field =~ s/([à]+)/<i>$1<\/i>/g;          # lower --> italic
    $field =~ s/(\p{Lu}+)/lc($1)/ge;          # upper --> lower
    $field =~ s/-=(.{1,4})/<sup>$1<\/sup>/g;  # OK
    
    }
    
print p(),        
    decode("utf8", "$row[0]\t$row[1]\n");
}

$sth->finish();
$dbh->disconnect() || die "fallita disconnessione\n";
[download]

It gives me back the following output:

.. various html tag and meta ..
r.1,1	1 ʾà-da-um-=TUG2-II 1 AKTUM-=TÚG 1 IB2-IV-=TÚG SA₆ DAR
r.1,2	g_*NI-ra-ar-=KI
r.1,3	2 ʾa3-da-um-=TÚG-II 1 ʾa3-da-um-=TÚG-I

Well, I really don't understand why the substitution regex doesn't work with unicode character such as accented wovel à (or Ú, or even the sign ʾ). I tried and change the à with its corresponding x{2be}, but nothing happens. My data originate from a MySQL table set with a utf8 charset. Is it possible that I didn't yet decoded my output when I send it to the for loop and to the sobstutions regex? Thank you

Comment on substitution regex and unicode Download Code

Replies are listed 'Best First'.
Re: substitution regex and unicode by Joost (Canon) on May 02, 2008 at 21:30 UTC
DBD::mysql does not provide unicode strings by default. You need to use version 4.004 or higher (earlier versions have serious unicode bugs) and set the mysql_enable_utf8 option. See also A UTF8 round trip with MySQL (and take note of the replies there). "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: substitution regex and unicode by frasco (Beadle) on May 03, 2008 at 10:09 UTC
Thank you Joost. I understood my mistake (and just this is a great goal)! When I retrieve data from MySQL I didn't tell it to make use of {mysql_enable_utf8 => 1}: `$dbh = DBI->connect($datasource, $user, $passw, {mysql_enable_utf8 => +1})` [download] If I well understand perl now has already all what he needs to work with unicode strings and, consequently, with regex. Thus I must delete the line decode("utf8"...) at the very end of my scrit and let alone those statements that must be printed out. Thank you again for submitting me that link.	[reply] [d/l]
Re^3: substitution regex and unicode by ikegami (Patriarch) on May 03, 2008 at 10:45 UTC
You should be encoding what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by `mysql_enable_utf8 => 1`) into bytes by encoding them. `print encode("UTF-8", "$row[0]\t$row[1]\n");` [download] or `binmode(STDOUT, ':encoding(UTF-8)'); print "$row[0]\t$row[1]\n";` [download]	[reply] [d/l] [select]
Re^4: substitution regex and unicode by frasco (Beadle) on May 07, 2008 at 18:40 UTC
Re^5: substitution regex and unicode by ikegami (Patriarch) on May 07, 2008 at 22:41 UTC