in reply to Faster utf8 escaping.

Another thing to bear in mind is that you can often get a long way with core modules that almost do what you want.

For example, you were using Unicode::Escape because it promised to conveniently turn non-ASCII characters into Javascript escape sequences. And this made sense, because that's what you wanted to end up with. Did you look at the core Encode module, though? There's at least one way to use that to solve your problem, and in my tests -- using your test cases -- it comes out about 35% faster than your hand-rolled version, while also providing all the extra stuff Unicode::Escape does like handling non-UTF-8 encodings or invalid UTF-8.

Replies are listed 'Best First'.
Re^2: Faster utf8 escaping.
by ikegami (Patriarch) on Apr 07, 2009 at 22:20 UTC

    First, it's important to note that Encode::FB_XMLCREF doesn't work on from_to when using 5.8.8's Encode 2.12.

    $ perl -MEncode=encode,from_to -wle'$s = encode("UTF-8", chr(0x2660)); + from_to($s, "UTF-8", "ascii", Encode::FB_XMLCREF); print $s' | od -c 0000000 342 231 240 \n 0000004

    The problem appears to be fixed in 2.13. Using Encode 2.33:

    $ perl -MEncode=encode,from_to -wle'$s = encode("UTF-8", chr(0x2660)); + from_to($s, "UTF-8", "ascii", Encode::FB_XMLCREF); print $s' | od -c 0000000 & # x 2 6 6 0 ; \n 0000011

    So it's not a solid solution to being with. (You'd have to use decode+encode instead of from_to.) But it's buggy even with a bug-free version of Encode.

    1..1 not ok 1 - With XML/HTML 4 digit hex entity # Failed test 'With XML/HTML 4 digit hex entity' # at a.pl line 24. # got: '\u2660\u2660' # expected: '♠\u2660' # Looks like you failed 1 test of 1.
Re^2: Faster utf8 escaping.
by kyle (Abbot) on Apr 08, 2009 at 01:05 UTC

    Thank you! I like this solution. You're right, core modules can often get you most of the way to where you want to go.

    That said, it needed some modification to work with the larger battery of tests I have locally. The first thing I noticed is that Encode doesn't always produce a four digit hex entity. I took care of that by allowing the regular expression to match two or four characters. Then came the string that encoded as "Mar&#ed;a; F", so I made the "characters" match only hex digits.

    Now the replacement looks like this:

    $s =~ s/&#x([a-f0-9]{2})?([a-f0-9]{2});/'\\u' . ($1||'00') . $2/ieg;

    That still doesn't take care of the problem that ikegami raises. I figure I can preprocess with s/&/&x/g and then back with s/&x/&/g when it's done (or something), but then we're back up to five full scans over the input string. It might still be faster, but I haven't tested that yet.

    Thanks again.