Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I have to import text files into a SQLite database. Text files are generally provided in UTF-8 encodings. However, I cannot exclude that a file can be in another format. I would like to check if my text file is in UTF-8, if not discard it (printing an error message). I have this script. But something seems not to work. I'm now trying to check every line, even if it could be -maybe- better to check the file as a whole

#!/usr/bin/perl use warnings; use strict; use Encode; use Encode::Guess; open (DATA, "<:utf8", "a.txt") or die $!; binmode DATA, ":utf8"; my $line = <DATA>; while($line){ my $decoder = guess_encoding($line); if (ref($decoder) eq 'Encode::utf8'){ print "File is in UTF-8\n"; #doing something } $line = <DATA>; } __END__

Replies are listed 'Best First'.
Re: Guessing encode text file
by McA (Priest) on Jan 20, 2014 at 15:44 UTC

    Hi,

    without knowing the way Encode::Guess works, I'm pretty sure you mix up two things. When you want to guess the encoding than you should provide the raw byte stream. But with putting the utf8-layer to the input stream you force Perl to interpret the byte stream as utf8 encoded characters. But exactly this is what you want to find out with guessing the encoding.

    So, open the file in binary mode, slurp the whole file if it's feasible and try guessing the encoding.

    Best regards
    McA

Re: Guessing encode text file
by aitap (Curate) on Jan 20, 2014 at 19:18 UTC
    Wouldn't it be more handy to use the error fallback options of the decode function from the standard module Encode? For example, you can set it to throw exception if text cannot be decoded as UTF-8 and catch the exception:
    my $characters = eval { decode utf8 => $bytes, Encode::FB_CROAK }; unless (defined $characters) { warn "$filename does not contain valid UTF-8 data, skipping"; next FILE; } # INSERT INTO ...
    (untested)