in reply to Re^2: Behaviour of Encode::decode_utf8 on ASCII
in thread Behaviour of Encode::decode_utf8 on ASCII

I understand your concern, but I'm still trying understand why the OP question comes up in the context of your app.

Are you getting actual utf8 data from a file handle that does not use the ":utf8" PerlIO layer (so that perl begins by assuming it's just a raw byte stream)? And if that's the case, are you trying to work out a way to use "byte-semantics" regexen where possible, and "character-semantics" only when necessary?

If that's your situation, here's an easy, low-cpu-load method to check whether a raw byte string needs be tagged as utf8:

if ( length($string) > $string =~ tr/\x00-\x7f// ) { $string = decode( 'utf8', $string ); }
(updated as per fenLisesi's reply -- thanks!)

Or, given that the original string is not tagged as a perl-internal utf8 scalar value (utf8 flag is off), this might be just as good or better:

if ( $string =~ /[\x80-\xff]/ ) { $string = decode( 'utf8', $string ); }
I'm not actually sure whether one way is faster than the other, or whether the relative speed would depend on your data; "length()" and "tr///" are both pretty fast whereas a regex match is slower, but tr always processes the whole string, whereas that regex match will stop at the first non-ascii byte.

Replies are listed 'Best First'.
Re^4: Behaviour of Encode::decode_utf8 on ASCII
by jbert (Priest) on Feb 15, 2007 at 08:13 UTC
    Thanks for this and yes, this is the approach we are currently trying.

    The data in this case is coming from a db with a db layer which isn't capable of automatically tagging data as utf8 (as well as sundry other places).

    But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead.

    It's do-able, and is the best way forward I can see at the moment, but I was hoping for an "ah...that behaviour changed in version X.y, you can get the old behaviour by frobbing this magic flag", from a wise monk. (I couldn't see anything in Encode.pm and friends, but I didn't dig into the XS).

    I'm also still very surprised that this behaviour has changed against the docs.

      But simply replacing calls to decode with calls to a wrapper function which does this check adds an additional perl function call and regex per db call, which has its own (small) overhead.

      Hmm. Well, I suppose you could eliminate the extra function call, at least, by replacing each decode call with an idiom like this:

      $string =~ /[\x80-\xBF]/ and $string = decode( 'utf8', $string );
      Note that every valid utf8 wide character will always have one byte with a value in the limited range of 0x80-0xBF, so that's the simplest, smallest, quickest regex match you can get to test for wide characters. If there are none, the statement short-circuits -- no function call at all (not even to decode).

      (update: Actually, it's also true that every valid utf8 wide character must have a first byte that matches  /[\xC2-\xF7]/ which is a somewhat smaller range to check.)

      Even if decode worked the way that the (faulty) docs said, the use of this sort of short-circuit idiom might still be faster than calling decode on every string.

      If that's still too slow and heavy for you, maybe you need to do some C/XS coding...

        Yes, this is do-able, but the cost is then in maintenance burden (this effectively inlines the function at all the call sites). We can work around this in various ways, I was really asking if people agreed it was a bug (which they don't seem to).

        I disagree that the docs are faulty. The code used to work that way. There is a good reason for the code to work that way.

        The docs (correctly) liken the utf8 flag to the string/integer tag on a scalar.

        I think a change like this is similar to, say, changing all perl numbers greater than a certain size to be held as a string representation "for consistency". It would change the performance characteristics, but not the correctness (assuming the relevant routines to perform numeric operations on strings of digits). People who did numerical work would be upset. Especially if the docs said "perl stores numbers in native format for speed".

        Thanks to everyone for their time and their comments. It's interesting no-one agrees with me that this is a problem. I'll take that as a sign to let it lie and we'll work around it as best we can.

Re^4: Behaviour of Encode::decode_utf8 on ASCII
by fenLisesi (Priest) on Feb 15, 2007 at 09:47 UTC
    A minor point: In the first sample code, I think you need to tell the transliterator that it should work on $string.