in reply to Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
in thread i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

Okay, here's what I came up with to test whether a file is valid utf8. I'm sure there's also some way to do this using a cpan module.

sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf +8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than + calling this directly. sub is_valid_utf8 { my $x = shift; my $leading0 = '[\x{0}-\x{7f}]'; my $leading10 = '[\x{80}-\x{bf}]'; my $leading110 = '[\x{c0}-\x{df}]'; my $leading1110 = '[\x{e0}-\x{ef}]'; my $leading11110 = '[\x{f0}-\x{f7}]'; my $utf8 = "($leading0|($leading110$leading10)|($leading1110$leading +10$leading10)|($leading11110$leading10$leading10$leading10))*"; return ($x=~/^$utf8$/); }

Replies are listed 'Best First'.
Re^3: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by Juerd (Abbot) on Feb 25, 2008 at 15:04 UTC

    If you have the raw bytestring, the easiest way to see if it's valid UTF-8 is to decode it to a unicode string. If that fails, it wasn't utf8 enough :)

    utf8::decode($string) or die "Input is not valid UTF-8";
    or
    utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";
    If you leave out the "or die" clause, any invalid UTF-8 will just be seen as ISO-8859-1.

    Update: changed the examples as per ikegami's sound response.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      That should be
      utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";

      It works in-place.