http://qs1969.pair.com?node_id=714637

MattLG has asked for the wisdom of the Perl Monks concerning the following question:

I have a string that I'm getting from a UTF-8 database that contains a UTF-8 character in amongst the generic ones. The string is "Mali Lošinj". I'm trying to turn the S w/caron into a normal S. My main script is doing a similar thing with other european characters (removing funny accents and stuff on them) but they other characters are all also present in ISO-8859-1. It's just the š character that I can't get to work. And I assume I'll have the same problem with any other utf8-only characters I come across in the future.

I have reduced the problem down to the following code:

#!/usr/bin/perl -C #use utf8; use DBI; $dbh = DBI->connect('DBI:mysql:database=*****;host=localhost;port=3306 +','*****','*****'); $dbh->do('SET NAMES utf8'); $sth = $dbh->prepare("select * from towns where town like \"Mali Lo%\" +"); $sth->execute; $p = $sth->fetchrow_hashref; $town = $p->{town}; print "Content-type: text/html\n\n"; print $town; $town1 = $town2 = $town3 = $town4 = $town5 = $town6 = $town; $town1 =~ tr/Š/s/; print "1($town1)"; $town2 =~ tr/š/s/; print "2($town2)"; $town3 =~ tr/Šš/ss/; print "3($town3)"; $town4 =~ s/š/s/g; print "4($town4)"; $town5 =~ s/Š/s/g; print "5($town5)"; $town6 =~ s/[Šš]/s/g; print "6($town6)";

I assumed I needed "use utf8" as the Š characters are "in the code", but if I uncomment "use utf8", none of the translations or substitions have any effect, I just print out the original text each time.

When "use utf8" is commented out, I get the following:

Mali Lošinj1(Mali Los�inj)2(Mali Lossinj)3(Mali Lossinj)4(Mali Lošinj)5(Mali Losinj)6(Mali Lossinj)

The first one, converts it to an "s" followed by an unidentified character (However, this should do nothing because it is the wrong case).
The second, third and sixth replace it with 2 "s" characters.
The fourth one does nothing, which is correct because it's matching the wrong case.
The fifth one seems to do what I want for this character, but I'd rather not have a different s/// line for EVERY utf8 character that I'm trying to convert.

So, what am I not understanding here? And what would you suggest as the most appropriate course of action? I have a single tr/// line altering 51 other ISO-8859-1/UTF-8 characters without any problem.

Cheers.

MattLG

Replies are listed 'Best First'.
Re: utf8 characters in tr/// or s///
by moritz (Cardinal) on Sep 30, 2008 at 21:00 UTC
    The only sane approach is to use utf8;, and to decode the strings that DBI returns with Encode::decode_utf8, unless your DBD::mysql does that for you already.

    I have no experience with tr and Unicode, but s/// works fine.

    When you want to print out stuff, you also need binmode STDOUT, ':encoding(UTF-8)'; or similar stuff.

    See also encodings, Unicode and Perl, perluniintro and perlunifaq.

    Update: And take a look at Text::Unidecode, it might safe you quite some work.

Re: utf8 characters in tr/// or s///
by graff (Chancellor) on Oct 01, 2008 at 03:32 UTC
    When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami -- so that perl has a valid utf8 string with the "utf8" flag turned on. Then, you can do lots of useful things using normal perl string operations.

    For example, here's a neat and easy way to eliminate all diacritic marks that come attached to ascii Latin alphabetic letters:

    use Encode qw/decode is_utf8/; use Unicode::Normalize; # let $string be value that was just fetched from a utf8 database fiel +d, # in which case, you will most likely need to do this: $string = decode( "utf8", $string ); # or just for testing, comment out the previous line, and # $string = join( "", map{chr()} 0xc0..0xff ); # uncomment this line # NFD normalization splits off all diacritic marks as separate code po +ints # and these "combining" marks for latin are in the U0300-U036F range ( $string_nd = NFD( $string )) =~ tr/\x{300}-\x{36f}//d; binmode STDOUT, ":utf8"; # just to be sure this has been done print "original: << $string >>\n"; print " edited: << $string_nd >>\n";
    Alas, that form of normalization does not convert "ø" to "o", or "Æ" to "AE", or "ß" to "ss", etc. That is, there may still be non-ascii characters in the final result, depending on what you have in your database, and for stuff like that, you'll just have to face the task of defining what sort of behavior you really want (e.g. just strip them out, or define an explicit list of replacements, or...)

    In case it might help, it's easy to get an inventory of the characters you have in the database, so that you can see which ones, if any, need special attention beyond just stripping diacritic marks. I posted a little tool here that shows one way to do that: unichist -- count/summarize characters in data.

    One other caveat about that normalization process: for a number of languages (e.g. those that use Arabic, Hebrew, Devanagari, or other non-Latin scripts with diacritic marks), you may want/need to apply "NFC" normalization (also provided by Unicode::Normalize) after doing "NFD" and Latin diacritic removal, so that you "recompose" the non-Latin characters and diacritics into their "canonical" combined-character forms.

    (update; having just seen ikegami's point about the "utf8::" functions, I agree -- that's a fine alternative to "use Encode".)

      When you fetch utf8 texrt from mysql, you should always run it through Encode::decode("utf8",...) -- update: or equivalent, as shown by ikegami

      When using a fairly recent version of DBD::mysql, you can use the mysql_enable_utf8 option. Or, to quote:

      This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off.

      When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string. You will also need to ensure that your database / table / column is configured to use UTF8. See Chapter 10 of the mysql manual for details.

      Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect.

      This option is experimental and may change in future versions.

      and yes, this is experimental, but seemed to work fairly stable in my tests.

      --
      b10m
        This is good to know -- thanks++!

        Based on the description, it sounds like it may be a while before this sort of facility becomes "normal", to the extent that folks would find transitioning to it to be easier than staying with the older approach.

        The situation reminds me of a Larry Wall quote (in the perlunicode mail list, wouldn't you know) -- this was four years ago, but it still resonates:

        Perl's always been about providing reasonable defaults, and will continue to do so. But changing what's reasonable is tricky, and sometimes you have to go through a period in which nothing can be considered reasonable.

      Brilliant! You guys RULE!

      Thanks.

      MattLG

        And one other thing that I'm finding conflicting advice for on the internet is packing the incoming data from CGI into utf8.

        I currently use:

        $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

        against the strings that come in via the web.

        Now I see that there's a "U" template for unicode. But I'm after UTF8, so that doesn't quite fit, and I don't understand what the pack docs are saying about UTF-8. However, in a couple of places I've searched I've found this:

        $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; utf8::decode($value);

        which I don't really understand. I'd assumed the "C" would put everything into ASCII/ISO-8859-1 and utf8::decoding that would just produce garbage out of the special characters.

        What would the monks advise?

        Cheers

        MattLG

Re: utf8 characters in tr/// or s///
by ikegami (Patriarch) on Oct 01, 2008 at 02:16 UTC

    I assumed I needed "use utf8" as the Š characters are "in the code"

    If your source file is UTF-8 encoded, you do. It decodes it.

    but if I uncomment "use utf8", none of the translations or substitions have any effect, I just print out the original text each time.

    I agree with moritz. That means the database is returning the strings encoded. You need to decode them.

    my $p = $sth->fetchrow_hashref; utf8::decode( $p->{town} ); my $town = $p->{town};

    (The "utf8::" functions are always present. No need to "use" anything.)

Re: utf8 characters in tr/// or s///
by MattLG (Sexton) on Sep 30, 2008 at 20:45 UTC

    Sorry, it's perl 5.8.8 on linux if that matters.

    And the -C switch doesn't seem to have any effect.