comment on

Torture? String eval happens every time you use a module.

Only if the module was not loaded before. Modules that have already been loaded are not evaluated again, see require.

s/\\u(\w{4})/eval "\"\\x{$1}\""/ge has one string eval per match. In a non-english text, that may by one eval for every few words. The german language is quite harmless, umlauts are quite rare, and the sharp s (ß) suffers from the new spelling rules that prefer ss. But other languages tend to decorate latin letters (the ASCII stuff) with all kinds of hooks, dots, slashes. And with messages written in non-latin alphabets (cyrillic, greek), words are composed entirely of \uXXXX, so you end with one string eval for every single letter of the message.

s/\\u([0-9a-fA-F]{4})/chr hex $1/ge also treats the replacement part as expression, but that's prepared at compile time, once.

There still is a trap: The \uXXXX notation is limited to 16 bits = 65536 characters, but Unicode is larger. It depends on the encoder how characters needing 17 or more bits are represented.

It would be wise to use the UTF-16 schema, i.e. surrogates, i.e. two \uXXXX sequences to encode one of those characters. If the encoder uses surrogates, the Perl code has to handle them accordingly. Encode::Unicode looks promising, but s///g could be sufficient (find surrogate pairs, calculate replacement character from surrogate character codes according to surrogate rules).

Another way could be to simply use more hex digits, perhaps by accident, so \u would be followed by five or six hex digits. If those are mixed with the four digit variant, it is impossible to decode the text without heuristics: What does "\u101112" represent? chr(0x1012).'12', chr(0x10111).'2' or chr(0x101112)?

By the way: s/\\u(\w{4})/\"\\x{$1}\"/gee should be equal to s/\\u(\w{4})/eval "\"\\x{$1}\""/ge, according to perlop. Still, I would prefer the explicit eval over /ee, because /ee looks too much like a typo.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^4: ASCII encoded unicode strings on web, such as \u00F3 by afoken
in thread ASCII encoded unicode strings on web, such as \u00F3 by igoryonya

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.