Re^2: unicode normalization layer

I believe

my $is_valid = utf8::decode($_);
[download]

is cheaper than

use Encode qw( decode );
my $is_valid = eval '$_ = decode("utf-8", $_, FB_CROAK); 1';
[download]

It's definitely simpler (and you don't even need to load any modules!)

Note that "utf8" is not the same thing "utf-8". "utf8" is the name of Perl's internal encoding. It differs from "utf-8". You definitely want to use "utf-8" when validating (if not always).

I also fixed the bug where decoding the string "0" would be considered a validation error.

Comment on Re^2: unicode normalization layer Select or Download Code

Replies are listed 'Best First'.
Re^3: unicode normalization layer by DrWhy (Chaplain) on Sep 17, 2009 at 05:18 UTC
This is certainly the simplest approach I've seen so far, and I'll definitely keep it in mind for future use. However, I'm currently using something closer to graff's approach. I need to have a count of the invalid items encountered in the input stream, so I've defined a CHECK function to be used by :encoding(utf8) that ticks up a counter of the number of bad things found and then returns the unicode WTF?! character to replace it in the input stream. As for the relative speed of getline (<>) and read block, I was recently working with a system where benchmarking showed the speed difference between the two approaches was quite substantial -- 7-8 times difference -- which is why I wanted to avoid getline in this case, especially since my processing needs are not specifically line-oriented. --DrWhy "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."	[reply]

Replies are listed 'Best First'.

Re^3: unicode normalization layer
by DrWhy (Chaplain) on Sep 17, 2009 at 05:18 UTC

As for the relative speed of getline (<>) and read block, I was recently working with a system where benchmarking showed the speed difference between the two approaches was quite substantial -- 7-8 times difference -- which is why I wanted to avoid getline in this case, especially since my processing needs are not specifically line-oriented.

--DrWhy

"If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

[reply]