Hey everyone!
I have a script that queries a game server for player names by UDP, parses the packet, and puts info about the server on a web page. Now the player names can and often do contain unicode characters, so I wrote a decode function that also converts the unicode into HTML code as some browsers seem buggy with raw unicode chars. The player names are displayed in "Tahoma" font on the web page, as this matches closely the in-game font. This works pretty well, but there is one guy who's name simply wont get decoded by my regex.
I extracted his name in byte representation:
$pname = "\x78\x54\xC5\x99\xD8\xB9\xD0\xBC\xD8\xB9\x2E\x20\xE1\xB8\xA0
+\x7C\x20\x7B\xC4\xA2\x6C\xC3\xA2\xC3\x90\xC3\xAE\xC3\xA2\x54\xC5";
1) Name in plain unicode:
xTřعмع. Ḡ| {ĢlâÐîâT
2) And now the name as it appears without unicode decoding:
xTřعмع. Ḡ| {Ä¢lâÃîâTÅ
You see the last byte, the A with ring above is not displayed at all in unicode, which is correct I guess. But its presence disrupts my routine somehow.
And this is my decode routine:
sub unicode_decode
{
my $string = shift;
utf8::decode($string);
$string =~ s/([^a-zA-Z0-9])/'&#'.unpack('U0U*',$1).';'/eg;
return($string);
}
Now whenever I remove the last byte \xC5 from the name above, the regex transforms the name perfectly (looks like 1). But if I dont remove it, the whole regex just does nothing and it looks like (2). Well, it does convert the name into HTML codes, but it does not interpret it as unicode.
Now I think that last \xC5 is probably not allowed in conjunction with what it stands behind or if it stands alone, from unicode perspective, but I cant control that, so what can I do about it?
Edit:
I now tried the Encode module and its decode() routine, which gives better results, and does not break my regex:
xTřعмع. Ḡ| {ĢlâÐîâT�
The last character is of decimal value 65533, I suppose I can live with a box behind the name of people who are unable to create correct unicode names :-)
Thanks,
Forlix