in reply to Re: Behaviour of Encode::decode_utf8 on ASCII
in thread Behaviour of Encode::decode_utf8 on ASCII

There's also this in the docs:
CAVEAT: When you run $string = decode("utf8", $octets), then $string m +ay not be equal to $octets. Though they both contain the same data, t +he utf8 flag for $string is on unless $octets entirely consists of AS +CII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below.
and the difference does matter from a performance point of view.

UTF-8 tagged values in perl are contagious - concatenation with an untagged value will result in a tagged value (all well and good). But the regex engine on unicode strings is slower than on byte strings.

Basically with this change in behaviour you can lose performance in a utf8-aware-and-correct application which has the vast majority of its inputs in ASCII, since the previously uncommon case of handling unicode strings is now the 100% case.

This isn't theoretical, I'm fighting a significant CPU cost increase, which adds up over many servers.

Replies are listed 'Best First'.
Re^3: Behaviour of Encode::decode_utf8 on ASCII
by graff (Chancellor) on Feb 15, 2007 at 06:11 UTC
    I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app.

    Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary?

    If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8:

    if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }
    (updated as per fenLisesi's reply -- thanks!)

    Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better:

    if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }
    I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.
      Thanks for this and yes, this is the approach we are currently trying.

      The data in this case is coming from a db with a db layer which isn't capable of automatically tagging data as utf8 (as well as sundry other places).

      But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead.

      It's do-able, and is the best way forward I can see at the moment, but I was hoping for an "ah...that behaviour changed in version X.y, you can get the old behaviour by frobbing this magic flag", from a wise monk. (I couldn't see anything in Encode.pm and friends, but I didn't dig into the XS).

      I'm also still very surprised that this behaviour has changed against the docs.

        But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead.

        Hmm. Well, I suppose you could eliminate the extra function call, at least, by replacing each decode call with an idiom like this:

        $string =~ /[\x80-\xBF]/ and $string = decode( 'utf8', $string );
        Note that every valid utf8 wide character will always have one byte with a value in the limited range of 0x80-0xBF, so that's the simplest, smallest, quickest regex match you can get to test for wide characters. If there are none, the statement short-circuits -- no function call at all (not even to decode).

        (update: Actually, it's also true that every valid utf8 wide character must have a first byte that matches  /[\xC2-\xF7]/ which is a somewhat smaller range to check.)

        Even if decode worked the way that the (faulty) docs said, the use of this sort of short-circuit idiom might still be faster than calling decode on every string.

        If that's still too slow and heavy for you, maybe you need to do some C/XS coding...

      A minor point: In the first sample code, I think you need to tell the transliterator that it should work on $string.