MattLG has asked for the wisdom of the Perl Monks concerning the following question:
I have a string that I'm getting from a UTF-8 database that contains a UTF-8 character in amongst the generic ones. The string is "Mali Lošinj". I'm trying to turn the S w/caron into a normal S. My main script is doing a similar thing with other european characters (removing funny accents and stuff on them) but they other characters are all also present in ISO-8859-1. It's just the š character that I can't get to work. And I assume I'll have the same problem with any other utf8-only characters I come across in the future.
I have reduced the problem down to the following code:
#!/usr/bin/perl -C #use utf8; use DBI; $dbh = DBI->connect('DBI:mysql:database=*****;host=localhost;port=3306 +','*****','*****'); $dbh->do('SET NAMES utf8'); $sth = $dbh->prepare("select * from towns where town like \"Mali Lo%\" +"); $sth->execute; $p = $sth->fetchrow_hashref; $town = $p->{town}; print "Content-type: text/html\n\n"; print $town; $town1 = $town2 = $town3 = $town4 = $town5 = $town6 = $town; $town1 =~ tr/Š/s/; print "1($town1)"; $town2 =~ tr/š/s/; print "2($town2)"; $town3 =~ tr/Šš/ss/; print "3($town3)"; $town4 =~ s/š/s/g; print "4($town4)"; $town5 =~ s/Š/s/g; print "5($town5)"; $town6 =~ s/[Šš]/s/g; print "6($town6)";
I assumed I needed "use utf8" as the Š characters are "in the code", but if I uncomment "use utf8", none of the translations or substitions have any effect, I just print out the original text each time.
When "use utf8" is commented out, I get the following:
Mali Lošinj1(Mali Los�inj)2(Mali Lossinj)3(Mali Lossinj)4(Mali Lošinj)5(Mali Losinj)6(Mali Lossinj)
The first one, converts it to an "s" followed by an unidentified character (However, this should do nothing because it is the wrong case).
The second, third and sixth replace it with 2 "s" characters.
The fourth one does nothing, which is correct because it's matching the wrong case.
The fifth one seems to do what I want for this character, but I'd rather not have a different s/// line for EVERY utf8 character that I'm trying to convert.
So, what am I not understanding here? And what would you suggest as the most appropriate course of action? I have a single tr/// line altering 51 other ISO-8859-1/UTF-8 characters without any problem.
Cheers.
MattLG
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: utf8 characters in tr/// or s///
by moritz (Cardinal) on Sep 30, 2008 at 21:00 UTC | |
Re: utf8 characters in tr/// or s///
by graff (Chancellor) on Oct 01, 2008 at 03:32 UTC | |
by b10m (Vicar) on Oct 05, 2008 at 20:03 UTC | |
by graff (Chancellor) on Oct 06, 2008 at 02:31 UTC | |
by MattLG (Sexton) on Oct 01, 2008 at 20:35 UTC | |
by MattLG (Sexton) on Oct 04, 2008 at 17:28 UTC | |
by graff (Chancellor) on Oct 06, 2008 at 02:20 UTC | |
Re: utf8 characters in tr/// or s///
by ikegami (Patriarch) on Oct 01, 2008 at 02:16 UTC | |
Re: utf8 characters in tr/// or s///
by MattLG (Sexton) on Sep 30, 2008 at 20:45 UTC |