comment on

If you don't have time to dive into PerlIO::via, would the following suffice (and if not, why not)?

#!/usr/bin/perl

use strict;
use Unicode::Normalize;

use open IN => ':encoding(utf8)';
binmode STDIN, ':encoding(utf8)';

while (<>) {
    $_ = NFKC( $_ );
    # now do whatever you want...
}
[download]

That applies the 'encoding:utf8' layer on all input files (including STDIN), so any method of reading input from any file handle will complain if the data can't be interpreted as utf8. Once it's read in, you just apply normalization, and do whatever else you need to do.

You weren't very specific on what you mean by "validate" (or what you want to do with invalid data). Note that the above doesn't actually die on invalid input; it just prints warning and tries to do the best it can with what it gets.

Personally, when I really want to know whether a file is valid utf8 (and I want to provide useful diagnostics when it isn't), I tend to read it as raw data and then do

    eval "Encode::decode('utf8', \$_, FB_CROAK)";
[download]

so I can trap non-utf8 input and give a proper error report.

Did I misunderstand the question? I don't think there'll be any significant speed-up by trying to do block-oriented input; input gets buffered anyway. (But the Encode man page does explain how to handle input that isn't "character oriented", if you really want to do that.)

UPDATE: ikegami's reply to my post is very helpful (++!), so definitely follow his advice over mine.

In reply to Re: unicode normalization layer by graff
in thread unicode normalization layer by DrWhy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.