comment on

I have a string that I'm getting from a UTF-8 database that contains a UTF-8 character in amongst the generic ones. The string is "Mali Lošinj". I'm trying to turn the S w/caron into a normal S. My main script is doing a similar thing with other european characters (removing funny accents and stuff on them) but they other characters are all also present in ISO-8859-1. It's just the š character that I can't get to work. And I assume I'll have the same problem with any other utf8-only characters I come across in the future.

I have reduced the problem down to the following code:

#!/usr/bin/perl -C

#use utf8;
use DBI;
$dbh = DBI->connect('DBI:mysql:database=*****;host=localhost;port=3306
+','*****','*****');
$dbh->do('SET NAMES utf8');
$sth = $dbh->prepare("select * from towns where town like \"Mali Lo%\"
+");
$sth->execute;
$p = $sth->fetchrow_hashref;
$town = $p->{town};

print "Content-type: text/html\n\n";
print $town;

$town1 = $town2 = $town3 = $town4 = $town5 = $town6 = $town;

$town1 =~ tr/Š/s/;
print "1($town1)";

$town2 =~ tr/š/s/;
print "2($town2)";

$town3 =~ tr/Šš/ss/;
print "3($town3)";

$town4 =~ s/š/s/g;
print "4($town4)";

$town5 =~ s/Š/s/g;
print "5($town5)";

$town6 =~ s/[Šš]/s/g;
print "6($town6)";
[download]

I assumed I needed "use utf8" as the Š characters are "in the code", but if I uncomment "use utf8", none of the translations or substitions have any effect, I just print out the original text each time.

When "use utf8" is commented out, I get the following:

Mali Lošinj1(Mali Los�inj)2(Mali Lossinj)3(Mali Lossinj)4(Mali Lošinj)5(Mali Losinj)6(Mali Lossinj)

The first one, converts it to an "s" followed by an unidentified character (However, this should do nothing because it is the wrong case).
The second, third and sixth replace it with 2 "s" characters.
The fourth one does nothing, which is correct because it's matching the wrong case.
The fifth one seems to do what I want for this character, but I'd rather not have a different s/// line for EVERY utf8 character that I'm trying to convert.

So, what am I not understanding here? And what would you suggest as the most appropriate course of action? I have a single tr/// line altering 51 other ISO-8859-1/UTF-8 characters without any problem.

Cheers.

MattLG

In reply to utf8 characters in tr/// or s/// by MattLG

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks