frasco has asked for the wisdom of the Perl Monks concerning the following question:

Maybe I'm tired but really I can't solve such a situation! I have the following script retrieving data from MySQL:
#!/usr/bin/perl -w use strict; use warnings; use DBI; use CGI qw(:standard); use CGI::Carp qw(fatalsToBrowser); use utf8; use Encode qw(decode decode); binmode(STDOUT, ":encoding(utf8)"); my ($datasource, $user, $passw, $dbh, $sth); my ($id_testo, $indice, $parole, $posizione); my (@row, $field); $datasource = "DBI:mysql:database=Test;host=xxxxxxx;"; $user = "xxxxxxxx"; $passw = ""; $dbh = DBI->connect($datasource, $user, $passw) || die "Error opening +db: $DBI::errstr\n"; $dbh->do("SET NAMES 'utf8'"); $sth = $dbh->prepare("SELECT indice, GROUP_CONCAT(parole SEPARATOR ' ' +) FROM testo GROUP BY indice"); $sth->execute(); print header(-type => "text/html", -charset => "utf-8"), start_html(-encoding => 'utf-8', "My_database"), "\n", h2("ARET 1.1"), "\n"; while (@row = $sth->fetchrow_array) { for $field(@row) { $field =~ s/([à]+)/<i>$1<\/i>/g; # lower --> italic $field =~ s/(\p{Lu}+)/lc($1)/ge; # upper --> lower $field =~ s/-=(.{1,4})/<sup>$1<\/sup>/g; # OK } print p(), decode("utf8", "$row[0]\t$row[1]\n"); } $sth->finish(); $dbh->disconnect() || die "fallita disconnessione\n";
It gives me back the following output:
.. various html tag and meta ..

r.1,1 1 ʾà-da-um-=TUG2-II 1 AKTUM-=TÚG 1 IB2-IV-=TÚG SA₆ DAR

r.1,2 g_*NI-ra-ar-=KI

r.1,3 2 ʾa3-da-um-=TÚG-II 1 ʾa3-da-um-=TÚG-I

Well, I really don't understand why the substitution regex doesn't work with unicode character such as accented wovel à (or Ú, or even the sign ʾ). I tried and change the à with its corresponding x{2be}, but nothing happens. My data originate from a MySQL table set with a utf8 charset. Is it possible that I didn't yet decoded my output when I send it to the for loop and to the sobstutions regex? Thank you

Replies are listed 'Best First'.
Re: substitution regex and unicode
by Joost (Canon) on May 02, 2008 at 21:30 UTC
      Thank you Joost. I understood my mistake (and just this is a great goal)! When I retrieve data from MySQL I didn't tell it to make use of {mysql_enable_utf8 => 1}:
      $dbh = DBI->connect($datasource, $user, $passw, {mysql_enable_utf8 => +1})
      If I well understand perl now has already all what he needs to work with unicode strings and, consequently, with regex. Thus I must delete the line decode("utf8"...) at the very end of my scrit and let alone those statements that must be printed out. Thank you again for submitting me that link.

        You should be *encoding* what you print. The system file handle and HTTP can only deal with bytes, which means the characters much be converted from Perl's internal string format (as returned by mysql_enable_utf8 => 1) into bytes by encoding them.

        print encode("UTF-8", "$row[0]\t$row[1]\n");
        or
        binmode(STDOUT, ':encoding(UTF-8)'); print "$row[0]\t$row[1]\n";