in reply to Behaviour of Encode::decode_utf8 on ASCII

Note that the docs do not specify for what encoding this should work. This is arguably a bug, but it's probably a bug in the docs; I would expect decode('utf8',$string) to just flag any input as utf8 since that's by far the most efficient way of "decoding" utf8.

It shouldn't really matter anyway. ASCII => utf8.

Update: it does the same for me. perl 5.8.8, Encode 2.12.

  • Comment on Re: Behaviour of Encode::decode_utf8 on ASCII

Replies are listed 'Best First'.
Re^2: Behaviour of Encode::decode_utf8 on ASCII
by jbert (Priest) on Feb 14, 2007 at 19:46 UTC
    There's also this in the docs:
    CAVEAT: When you run $string = decode("utf8", $octets), then $string m +ay not be equal to $octets. Though they both contain the same data, t +he utf8 flag for $string is on unless $octets entirely consists of AS +CII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below.
    and the difference does matter from a performance point of view.

    UTF-8 tagged values in perl are contagious - concatenation with an untagged value will result in a tagged value (all well and good). But the regex engine on unicode strings is slower than on byte strings.

    Basically with this change in behaviour you can lose performance in a utf8-aware-and-correct application which has the vast majority of its inputs in ASCII, since the previously uncommon case of handling unicode strings is now the 100% case.

    This isn't theoretical, I'm fighting a significant CPU cost increase, which adds up over many servers.

      I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app.

      Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary?

      If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8:

      if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }
      (updated as per fenLisesi's reply -- thanks!)

      Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better:

      if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }
      I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.
        Thanks for this and yes, this is the approach we are currently trying.

        The data in this case is coming from a db with a db layer which isn't capable of automatically tagging data as utf8 (as well as sundry other places).

        But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead.

        It's do-able, and is the best way forward I can see at the moment, but I was hoping for an "ah...that behaviour changed in version X.y, you can get the old behaviour by frobbing this magic flag", from a wise monk. (I couldn't see anything in Encode.pm and friends, but I didn't dig into the XS).

        I'm also still very surprised that this behaviour has changed against the docs.

        A minor point: In the first sample code, I think you need to tell the transliterator that it should work on $string.