There are monks who have to coax UTF-8 out of older mysql servers under older versions of Perl in their daily labours. The builtins pack() and unpack() can be used in such cases, for versions of Perl >= 5.6.
I offer the following humble script, in the hope that parts of it might aid others in their assigned tasks.
#!/usr/bin/perl
# Demonstrate some real-world encoding fixes.
# This has been tested on Perls 5.6.1 and 5.8.0.
BEGIN {
require v5.6.0;
}
use utf8;
no bytes;
binmode STDOUT, ':utf8' if $] >= 5.008; # suppresses a warning
# Correctly encoded data (string-is-unicode bit set)
print "\n\$good:\n";
$good = chr(0x03B1) . chr(0x00DF) . chr(0x044D);
print $good, "\n",
length($good), "\n"; # three characters
# Make a copy without the string-is-unicode bit set on it
# This is the kind of thing DBD::mysql returns if you put something li
+ke $good
# into the database originally.
binmode STDOUT, ':bytes' if $] >= 5.008;
print "\n\$bad:\n";
$bad = pack("C0C*", unpack("C0C*", $good));
print $bad, "\n",
length($bad), "\n"; # six bytes
print "\n(\$bad eq \$good): " . (($bad eq $good) ? "yes" : "no") . "\n
+";
# At Perl 5.6.1, this says "yes".
# At Perl 5.8.0, this says "no".
# Repack the bad string into another correctly-tagged string
binmode STDOUT, ':utf8' if $] >= 5.008;
print "\n\$also_good:\n";
$also_good = pack("U0U*", unpack("U0U*", $bad));
print $also_good, "\n",
length($also_good), "\n";
print "\n(\$bad eq \$also_good): "
. (($bad eq $also_good) ? "yes" : "no") . "\n";
print "(\$good eq \$also_good): "
. (($good eq $also_good) ? "yes" : "no") . "\n\n";
There's a meditation in here on "U0U*" vs. "U*", I think.
Note: I use a UTF-8-capable terminal, hence the fiddling with binmode. That's another can of worms though.