That applies the 'encoding:utf8' layer on all input files (including STDIN), so any method of reading input from any file handle will complain if the data can't be interpreted as utf8. Once it's read in, you just apply normalization, and do whatever else you need to do.#!/usr/bin/perl use strict; use Unicode::Normalize; use open IN => ':encoding(utf8)'; binmode STDIN, ':encoding(utf8)'; while (<>) { $_ = NFKC( $_ ); # now do whatever you want... }
You weren't very specific on what you mean by "validate" (or what you want to do with invalid data). Note that the above doesn't actually die on invalid input; it just prints warning and tries to do the best it can with what it gets.
Personally, when I really want to know whether a file is valid utf8 (and I want to provide useful diagnostics when it isn't), I tend to read it as raw data and then do
so I can trap non-utf8 input and give a proper error report.eval "Encode::decode('utf8', \$_, FB_CROAK)";
Did I misunderstand the question? I don't think there'll be any significant speed-up by trying to do block-oriented input; input gets buffered anyway. (But the Encode man page does explain how to handle input that isn't "character oriented", if you really want to do that.)
UPDATE: ikegami's reply to my post is very helpful (++!), so definitely follow his advice over mine.
In reply to Re: unicode normalization layer
by graff
in thread unicode normalization layer
by DrWhy
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |