The unsaid base problem here is that someone started with the string "ต้มยำกุ้ง" but then incorrectly decoded it (probably to Windows cp1258) to create a new string of "ต้มยำกุ้ง"
Note that in UTF-8, each of the nine charcters in the Thai string takes three bytes. Therefore the Latin 1 decoding includes 9 x 3 = 27 characters and each triplet begins with . It is often the case that an incorrectly decoded UTF-8 string into a Latin 1 character set will show each original character as beginning with some accented form of the letter a or A.
I was not able to repair the string in place, but writing it to a file, I can use Perl's IO Layers and the Encode module to repair the encoding.
use strict; use Encode; $|++; my $t = 'thai.txt'; # contains => ต้มยำกุ้ง open my $fh, '<:raw', $t or die "Couldn't open $t: $!"; my $content = do { local $/; <$fh> }; close $fh; $content = decode('UTF-8', $content); binmode *STDOUT, ':encoding(UTF-8)'; print "$content\n";
However, note that rather than writing this program you can use Perl's wonderful character encoder/decoder without writing any code:
piconv -t UTF-8 thai.txt > thai_fixed.txt
In reply to Re^3: How to convert grabled characters into their real value
by rcrews
in thread How to convert grabled characters into their real value
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |