Re^2: Behaviour of Encode::decode

There's also this in the docs:

CAVEAT: When you run $string = decode("utf8", $octets), then $string m
+ay not be equal to $octets. Though they both contain the same data, t
+he utf8 flag for $string is on unless $octets entirely consists of AS
+CII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below.
[download]

and the difference does matter from a performance point of view.

UTF-8 tagged values in perl are contagious - concatenation with an untagged value will result in a tagged value (all well and good). But the regex engine on unicode strings is slower than on byte strings.

Basically with this change in behaviour you can lose performance in a utf8-aware-and-correct application which has the vast majority of its inputs in ASCII, since the previously uncommon case of handling unicode strings is now the 100% case.

This isn't theoretical, I'm fighting a significant CPU cost increase, which adds up over many servers.

Comment on Re^2: Behaviour of Encode::decode_utf8 on ASCII Download Code

Replies are listed 'Best First'.
Re^3: Behaviour of Encode::decode_utf8 on ASCII by graff (Chancellor) on Feb 15, 2007 at 06:11 UTC
I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app. Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary? If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8: `if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }` [download] (updated as per fenLisesi's reply -- thanks!) Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better: `if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }` [download] I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.	[reply] [d/l] [select]
Re^4: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 15, 2007 at 08:13 UTC
Thanks for this and yes, this is the approach we are currently trying. The data in this case is coming from a db with a db layer which isn't capable of automatically tagging data as utf8 (as well as sundry other places). But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead. It's do-able, and is the best way forward I can see at the moment, but I was hoping for an "ah...that behaviour changed in version X.y, you can get the old behaviour by frobbing this magic flag", from a wise monk. (I couldn't see anything in Encode.pm and friends, but I didn't dig into the XS). I'm also still very surprised that this behaviour has changed against the docs.	[reply]
Re^5: Behaviour of Encode::decode_utf8 on ASCII by graff (Chancellor) on Feb 15, 2007 at 09:37 UTC
But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead. Hmm. Well, I suppose you could eliminate the extra function call, at least, by replacing each decode call with an idiom like this: `$string =~ /[\x80-\xBF]/ and $string = decode( 'utf8', $string );` [download] Note that every valid utf8 wide character will always have one byte with a value in the limited range of 0x80-0xBF, so that's the simplest, smallest, quickest regex match you can get to test for wide characters. If there are none, the statement short-circuits -- no function call at all (not even to decode). (update: Actually, it's also true that every valid utf8 wide character must have a first byte that matches `/[\xC2-\xF7]/` which is a somewhat smaller range to check.) Even if decode worked the way that the (faulty) docs said, the use of this sort of short-circuit idiom might still be faster than calling decode on every string. If that's still too slow and heavy for you, maybe you need to do some C/XS coding...	[reply] [d/l] [select]
Re^6: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 15, 2007 at 12:57 UTC
Re^4: Behaviour of Encode::decode_utf8 on ASCII by fenLisesi (Priest) on Feb 15, 2007 at 09:47 UTC
A minor point: In the first sample code, I think you need to tell the transliterator that it should work on `$string`.	[reply] [d/l]