Re^4: ASCII encoded unicode strings on web, such as \u00F3

Torture? String eval happens every time you use a module.

Only if the module was not loaded before. Modules that have already been loaded are not evaluated again, see require.

s/\\u(\w{4})/eval "\"\\x{$1}\""/ge has one string eval per match. In a non-english text, that may by one eval for every few words. The german language is quite harmless, umlauts are quite rare, and the sharp s (ß) suffers from the new spelling rules that prefer ss. But other languages tend to decorate latin letters (the ASCII stuff) with all kinds of hooks, dots, slashes. And with messages written in non-latin alphabets (cyrillic, greek), words are composed entirely of \uXXXX, so you end with one string eval for every single letter of the message.

s/\\u([0-9a-fA-F]{4})/chr hex $1/ge also treats the replacement part as expression, but that's prepared at compile time, once.

There still is a trap: The \uXXXX notation is limited to 16 bits = 65536 characters, but Unicode is larger. It depends on the encoder how characters needing 17 or more bits are represented.

It would be wise to use the UTF-16 schema, i.e. surrogates, i.e. two \uXXXX sequences to encode one of those characters. If the encoder uses surrogates, the Perl code has to handle them accordingly. Encode::Unicode looks promising, but s///g could be sufficient (find surrogate pairs, calculate replacement character from surrogate character codes according to surrogate rules).

Another way could be to simply use more hex digits, perhaps by accident, so \u would be followed by five or six hex digits. If those are mixed with the four digit variant, it is impossible to decode the text without heuristics: What does "\u101112" represent? chr(0x1012).'12', chr(0x10111).'2' or chr(0x101112)?

By the way: s/\\u(\w{4})/\"\\x{$1}\"/gee should be equal to s/\\u(\w{4})/eval "\"\\x{$1}\""/ge, according to perlop. Still, I would prefer the explicit eval over /ee, because /ee looks too much like a typo.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Comment on Re^4: ASCII encoded unicode strings on web, such as \u00F3 Select or Download Code