in reply to Re: unicode normalization layer
in thread unicode normalization layer

I believe

my $is_valid = utf8::decode($_);
is cheaper than
use Encode qw( decode ); my $is_valid = eval '$_ = decode("utf-8", $_, FB_CROAK); 1';

It's definitely simpler (and you don't even need to load any modules!)

Note that "utf8" is not the same thing "utf-8". "utf8" is the name of Perl's internal encoding. It differs from "utf-8". You definitely want to use "utf-8" when validating (if not always).

I also fixed the bug where decoding the string "0" would be considered a validation error.

Replies are listed 'Best First'.
Re^3: unicode normalization layer
by DrWhy (Chaplain) on Sep 17, 2009 at 05:18 UTC
    This is certainly the simplest approach I've seen so far, and I'll definitely keep it in mind for future use. However, I'm currently using something closer to graff's approach. I need to have a count of the invalid items encountered in the input stream, so I've defined a CHECK function to be used by :encoding(utf8) that ticks up a counter of the number of bad things found and then returns the unicode WTF?! character to replace it in the input stream.

    As for the relative speed of getline (<>) and read block, I was recently working with a system where benchmarking showed the speed difference between the two approaches was quite substantial -- 7-8 times difference -- which is why I wanted to avoid getline in this case, especially since my processing needs are not specifically line-oriented.

    --DrWhy

    "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."