Re: unicode normalization layer

If you don't have time to dive into PerlIO::via, would the following suffice (and if not, why not)?

#!/usr/bin/perl

use strict;
use Unicode::Normalize;

use open IN => ':encoding(utf8)';
binmode STDIN, ':encoding(utf8)';

while (<>) {
    $_ = NFKC( $_ );
    # now do whatever you want...
}
[download]

That applies the 'encoding:utf8' layer on all input files (including STDIN), so any method of reading input from any file handle will complain if the data can't be interpreted as utf8. Once it's read in, you just apply normalization, and do whatever else you need to do.

You weren't very specific on what you mean by "validate" (or what you want to do with invalid data). Note that the above doesn't actually die on invalid input; it just prints warning and tries to do the best it can with what it gets.

Personally, when I really want to know whether a file is valid utf8 (and I want to provide useful diagnostics when it isn't), I tend to read it as raw data and then do

    eval "Encode::decode('utf8', \$_, FB_CROAK)";
[download]

so I can trap non-utf8 input and give a proper error report.

Did I misunderstand the question? I don't think there'll be any significant speed-up by trying to do block-oriented input; input gets buffered anyway. (But the Encode man page does explain how to handle input that isn't "character oriented", if you really want to do that.)

UPDATE: ikegami's reply to my post is very helpful (++!), so definitely follow his advice over mine.

Comment on Re: unicode normalization layer Select or Download Code

Replies are listed 'Best First'.
Re^2: unicode normalization layer by ikegami (Patriarch) on Sep 16, 2009 at 14:41 UTC
I believe `my $is_valid = utf8::decode($_);` [download] is cheaper than `use Encode qw( decode ); my $is_valid = eval '$_ = decode("utf-8", $_, FB_CROAK); 1';` [download] It's definitely simpler (and you don't even need to load any modules!) Note that "utf8" is not the same thing "utf-8". "utf8" is the name of Perl's internal encoding. It differs from "utf-8". You definitely want to use "utf-8" when validating (if not always). I also fixed the bug where decoding the string "0" would be considered a validation error.	[reply] [d/l] [select]
Re^3: unicode normalization layer by DrWhy (Chaplain) on Sep 17, 2009 at 05:18 UTC
This is certainly the simplest approach I've seen so far, and I'll definitely keep it in mind for future use. However, I'm currently using something closer to graff's approach. I need to have a count of the invalid items encountered in the input stream, so I've defined a CHECK function to be used by :encoding(utf8) that ticks up a counter of the number of bad things found and then returns the unicode WTF?! character to replace it in the input stream. As for the relative speed of getline (<>) and read block, I was recently working with a system where benchmarking showed the speed difference between the two approaches was quite substantial -- 7-8 times difference -- which is why I wanted to avoid getline in this case, especially since my processing needs are not specifically line-oriented. --DrWhy "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."	[reply]