Re: Behaviour of Encode::decode

Replies are listed 'Best First'.
Re^2: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 14, 2007 at 19:46 UTC
There's also this in the docs: `CAVEAT: When you run $string = decode("utf8", $octets), then $string m +ay not be equal to $octets. Though they both contain the same data, t +he utf8 flag for $string is on unless $octets entirely consists of AS +CII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below.` [download] and the difference does matter from a performance point of view. UTF-8 tagged values in perl are contagious - concatenation with an untagged value will result in a tagged value (all well and good). But the regex engine on unicode strings is slower than on byte strings. Basically with this change in behaviour you can lose performance in a utf8-aware-and-correct application which has the vast majority of its inputs in ASCII, since the previously uncommon case of handling unicode strings is now the 100% case. This isn't theoretical, I'm fighting a significant CPU cost increase, which adds up over many servers.	[reply] [d/l]
Re^3: Behaviour of Encode::decode_utf8 on ASCII by graff (Chancellor) on Feb 15, 2007 at 06:11 UTC
I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app. Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary? If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8: `if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }` [download] (updated as per fenLisesi's reply -- thanks!) Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better: `if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }` [download] I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.	[reply] [d/l] [select]
Re^4: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 15, 2007 at 08:13 UTC
Thanks for this and yes, this is the approach we are currently trying. The data in this case is coming from a db with a db layer which isn't capable of automatically tagging data as utf8 (as well as sundry other places). But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead. It's do-able, and is the best way forward I can see at the moment, but I was hoping for an "ah...that behaviour changed in version X.y, you can get the old behaviour by frobbing this magic flag", from a wise monk. (I couldn't see anything in Encode.pm and friends, but I didn't dig into the XS). I'm also still very surprised that this behaviour has changed against the docs.	[reply]
Re^5: Behaviour of Encode::decode_utf8 on ASCII by graff (Chancellor) on Feb 15, 2007 at 09:37 UTC
Re^6: Behaviour of Encode::decode_utf8 on ASCII by jbert (Priest) on Feb 15, 2007 at 12:57 UTC
Re^4: Behaviour of Encode::decode_utf8 on ASCII by fenLisesi (Priest) on Feb 15, 2007 at 09:47 UTC
A minor point: In the first sample code, I think you need to tell the transliterator that it should work on `$string`.	[reply] [d/l]