comment on

Hey everyone!

I have a script that queries a game server for player names by UDP, parses the packet, and puts info about the server on a web page. Now the player names can and often do contain unicode characters, so I wrote a decode function that also converts the unicode into HTML code as some browsers seem buggy with raw unicode chars. The player names are displayed in "Tahoma" font on the web page, as this matches closely the in-game font. This works pretty well, but there is one guy who's name simply wont get decoded by my regex.
I extracted his name in byte representation:

$pname = "\x78\x54\xC5\x99\xD8\xB9\xD0\xBC\xD8\xB9\x2E\x20\xE1\xB8\xA0
+\x7C\x20\x7B\xC4\xA2\x6C\xC3\xA2\xC3\x90\xC3\xAE\xC3\xA2\x54\xC5";
[download]

1) Name in plain unicode:
xTřعмع. Ḡ| {ĢlâÐîâT

2) And now the name as it appears without unicode decoding:
xTÅ™Ø¹Ð¼Ø¹. á¸ | {Ä¢lÃ¢ÃÃ®Ã¢TÅ

You see the last byte, the A with ring above is not displayed at all in unicode, which is correct I guess. But its presence disrupts my routine somehow.

And this is my decode routine:

sub unicode_decode
{
  my $string = shift;
  utf8::decode($string);
  $string =~ s/([^a-zA-Z0-9])/'&#'.unpack('U0U*',$1).';'/eg;

  return($string);
}
[download]

Now whenever I remove the last byte \xC5 from the name above, the regex transforms the name perfectly (looks like 1). But if I dont remove it, the whole regex just does nothing and it looks like (2). Well, it does convert the name into HTML codes, but it does not interpret it as unicode.

Now I think that last \xC5 is probably not allowed in conjunction with what it stands behind or if it stands alone, from unicode perspective, but I cant control that, so what can I do about it?

Edit:
I now tried the Encode module and its decode() routine, which gives better results, and does not break my regex:
xTřعмع. Ḡ| {ĢlâÐîâT�

The last character is of decimal value 65533, I suppose I can live with a box behind the name of people who are unable to create correct unicode names :-)

Thanks,
Forlix

In reply to Unicode to HTML code &#....; by Forlix

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.