http://qs1969.pair.com?node_id=714637

MattLG has asked for the wisdom of the Perl Monks concerning the following question:

I have a string that I'm getting from a UTF-8 database that contains a UTF-8 character in amongst the generic ones. The string is "Mali Lošinj". I'm trying to turn the S w/caron into a normal S. My main script is doing a similar thing with other european characters (removing funny accents and stuff on them) but they other characters are all also present in ISO-8859-1. It's just the š character that I can't get to work. And I assume I'll have the same problem with any other utf8-only characters I come across in the future.

I have reduced the problem down to the following code:

#!/usr/bin/perl -C #use utf8; use DBI; $dbh = DBI->connect('DBI:mysql:database=*****;host=localhost;port=3306 +','*****','*****'); $dbh->do('SET NAMES utf8'); $sth = $dbh->prepare("select * from towns where town like \"Mali Lo%\" +"); $sth->execute; $p = $sth->fetchrow_hashref; $town = $p->{town}; print "Content-type: text/html\n\n"; print $town; $town1 = $town2 = $town3 = $town4 = $town5 = $town6 = $town; $town1 =~ tr/Š/s/; print "1($town1)"; $town2 =~ tr/š/s/; print "2($town2)"; $town3 =~ tr/Šš/ss/; print "3($town3)"; $town4 =~ s/š/s/g; print "4($town4)"; $town5 =~ s/Š/s/g; print "5($town5)"; $town6 =~ s/[Šš]/s/g; print "6($town6)";

I assumed I needed "use utf8" as the Š characters are "in the code", but if I uncomment "use utf8", none of the translations or substitions have any effect, I just print out the original text each time.

When "use utf8" is commented out, I get the following:

Mali Lošinj1(Mali Los�inj)2(Mali Lossinj)3(Mali Lossinj)4(Mali Lošinj)5(Mali Losinj)6(Mali Lossinj)

The first one, converts it to an "s" followed by an unidentified character (However, this should do nothing because it is the wrong case).
The second, third and sixth replace it with 2 "s" characters.
The fourth one does nothing, which is correct because it's matching the wrong case.
The fifth one seems to do what I want for this character, but I'd rather not have a different s/// line for EVERY utf8 character that I'm trying to convert.

So, what am I not understanding here? And what would you suggest as the most appropriate course of action? I have a single tr/// line altering 51 other ISO-8859-1/UTF-8 characters without any problem.

Cheers.

MattLG