Forlix has asked for the wisdom of the Perl Monks concerning the following question:

Hey everyone!

I have a script that queries a game server for player names by UDP, parses the packet, and puts info about the server on a web page. Now the player names can and often do contain unicode characters, so I wrote a decode function that also converts the unicode into HTML code as some browsers seem buggy with raw unicode chars. The player names are displayed in "Tahoma" font on the web page, as this matches closely the in-game font. This works pretty well, but there is one guy who's name simply wont get decoded by my regex.
I extracted his name in byte representation:

$pname = "\x78\x54\xC5\x99\xD8\xB9\xD0\xBC\xD8\xB9\x2E\x20\xE1\xB8\xA0 +\x7C\x20\x7B\xC4\xA2\x6C\xC3\xA2\xC3\x90\xC3\xAE\xC3\xA2\x54\xC5";


1) Name in plain unicode:
xTřعмع. Ḡ| {ĢlâÐîâT

2) And now the name as it appears without unicode decoding:
xTřعмع. Ḡ| {Ä¢lâÃîâTÅ

You see the last byte, the A with ring above is not displayed at all in unicode, which is correct I guess. But its presence disrupts my routine somehow.

And this is my decode routine:
sub unicode_decode { my $string = shift; utf8::decode($string); $string =~ s/([^a-zA-Z0-9])/'&#'.unpack('U0U*',$1).';'/eg; return($string); }

Now whenever I remove the last byte \xC5 from the name above, the regex transforms the name perfectly (looks like 1). But if I dont remove it, the whole regex just does nothing and it looks like (2). Well, it does convert the name into HTML codes, but it does not interpret it as unicode.

Now I think that last \xC5 is probably not allowed in conjunction with what it stands behind or if it stands alone, from unicode perspective, but I cant control that, so what can I do about it?

Edit:
I now tried the Encode module and its decode() routine, which gives better results, and does not break my regex:
xTřعмع. Ḡ| {ĢlâÐîâT�

The last character is of decimal value 65533, I suppose I can live with a box behind the name of people who are unable to create correct unicode names :-)

Thanks,
Forlix

Replies are listed 'Best First'.
Re: Unicode to HTML code &#....;
by ikegami (Patriarch) on Nov 15, 2008 at 19:23 UTC

    First, let's clear some confusion. Unicode doesn't specify how characters are stored, so you can't possible be talking about Unicode when you're talking about a string of bytes. It looks like you meant UTF-8 when you said Unicode. UTF-8 is a means of representing (encoding) Unicode characters in bytes.

    $string =~ s/([^a-zA-Z0-9])/'&#'.unpack('U0U*',$1).';'/eg;

    can also be written as

    use HTML::Entites qw( encode_entities ); $string = encode_entities($string);

    and

    use Encode qw( encode ); $string = encode('US-ASCII', $string, Encode::FB_HTMLCREF);

    No need to reinvent the wheel.

    If you use the latter, you can combine the decoding and encoding into one step.

    use Encode qw( from_to ); sub unicode_decode { my $string = shift; from_to($string, 'UTF-8', 'US-ASCII', Encode::FB_HTMLCREF); return($string); }
      Thanks to both of you.
      The thing is, $string must not contain certain characters like comma and slash, since I use those as separators in my text files. Thats what the regex also ensures, so I think its still the best choice here given the circumstances.
      So I now go with
      use Encode qw(decode); sub unicode_decode { my $string = decode('utf8', shift, 0); $string =~ tr/\x{FFFD}/\x20/; $string =~ s/([^a-zA-Z0-9\_\+\-\.])/'&#'.unpack('U0U*',$1).';'/eg; return($string); }
      As you can see, this also swaps the replacement character with a space should there be one.
Re: Unicode to HTML code &#....;
by graff (Chancellor) on Nov 15, 2008 at 22:45 UTC
    I'm guessing that the player who wanted to use this name ran afoul of a length limit, which was apparently imposed by byte count rather than character count -- e.g. maximum name length was 31 bytes, and this just happened to fall in the middle of a two-byte utf8 character, causing the last byte to be uninterpretable as utf8. (Whoever is responsible for imposing the length limit should revisit the issue.)

    I think your method (in your later reply) of using a space to replace each "\x{FFFD}" (the unicode replacement character, which is inserted whenever there is an "uninterpretable" byte sequence) is as good as any, though maybe the "ellipsis" character ("\x{2026}" or "\x{22ef}") would be more appropriate.

    IMHO, anyone who goes to the trouble of creating a "name" that contains both Latin-based (left-to-right) and Arabic-based (right-to-left) characters in a single word token is most likely trying to make trouble, and should expect (presumably wants) to see things go wrong.

      You're right, it is a limit, presumably for the names being stored in a 32 byte string with null-termination.

      But you can't possibly suggest those people to be looking for trouble, as most are merely kids trying to appear "cool" with a fancy name, and many don't even know what Unicode or UTF-8 is. They simply gather some nice looking characters from a character map and assemble a name as if they were playing with LEGO bricks.

      For anyone who wants to see the script in action, you can find it on http://forlix.org/ (the table on the bottom right). I have also added some whitespace treatments, so multiple spaces wont be collapsed (the CSS solution white-space:pre isn't yet supported well enough)
Re: Unicode to HTML code &#....;
by JavaFan (Canon) on Nov 15, 2008 at 18:49 UTC
    I cannot reproduce that:
    my $n = "\x78\x54\xC5\x99\xD8\xB9\xD0\xBC\xD8\xB9\x2E\x20\xE1\xB8\xA0\ +x7C\x20\x7 B\xC4\xA2\x6C\xC3\xA2\xC3\x90\xC3\xAE\xC3\xA2\x54\xC5"; my $m = $n; chop $m; say 'Decode error with \xC5' unless utf8::decode ($n); say 'Decode error without \xC5' unless utf8::decode ($m); say "With \\xC5: $n"; say "Without \\xC5: $m"; __END__
    Decode error with \xC5
    With \xC5: xTřعмع. Ḡ| {ĢlâÐîâTÅ
    Without \xC5: xTřعмع. Ḡ| {ĢlâÐîâT

    Note that with the trailing \xC5, utf8::decode detects an error.