in reply to Re^4: Behaviour of Encode::decode_utf8 on ASCII
in thread Behaviour of Encode::decode_utf8 on ASCII

But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead.

Hmm. Well, I suppose you could eliminate the extra function call, at least, by replacing each decode call with an idiom like this:

$string =~ /[\x80-\xBF]/ and $string = decode( 'utf8', $string );
Note that every valid utf8 wide character will always have one byte with a value in the limited range of 0x80-0xBF, so that's the simplest, smallest, quickest regex match you can get to test for wide characters. If there are none, the statement short-circuits -- no function call at all (not even to decode).

(update: Actually, it's also true that every valid utf8 wide character must have a first byte that matches  /[\xC2-\xF7]/ which is a somewhat smaller range to check.)

Even if decode worked the way that the (faulty) docs said, the use of this sort of short-circuit idiom might still be faster than calling decode on every string.

If that's still too slow and heavy for you, maybe you need to do some C/XS coding...

Replies are listed 'Best First'.
Re^6: Behaviour of Encode::decode_utf8 on ASCII
by jbert (Priest) on Feb 15, 2007 at 12:57 UTC
    Yes, this is do-able, but the cost is then in maintenance burden (this effectively inlines the function at all the call sites). We can work around this in various ways, I was really asking if people agreed it was a bug (which they don't seem to).

    I disagree that the docs are faulty. The code used to work that way. There is a good reason for the code to work that way.

    The docs (correctly) liken the utf8 flag to the string/integer tag on a scalar.

    I think a change like this is similar to, say, changing all perl numbers greater than a certain size to be held as a string representation "for consistency". It would change the performance characteristics, but not the correctness (assuming the relevant routines to perform numeric operations on strings of digits). People who did numerical work would be upset. Especially if the docs said "perl stores numbers in native format for speed".

    Thanks to everyone for their time and their comments. It's interesting no-one agrees with me that this is a problem. I'll take that as a sign to let it lie and we'll work around it as best we can.