endymion has asked for the wisdom of the Perl Monks concerning the following question:

Hello friends, I have a little problem to solve. I'm getting files with different charsets and need to know the encoding of the files. I was trying to get it with Encode::Detect::Detector but I don't really know how to work with it. For example I want to know the charset of file1 and want the information back in a variable, that there is really e.g. "utf8" in this variable. Any Ideas?

Replies are listed 'Best First'.
Re: Detect the Charset of an file
by Corion (Patriarch) on Oct 21, 2013 at 13:40 UTC

    Have you looked at the SYNOPSIS section in Encode::Detect::Detector? It seems quite complete to me and suggests a way to obtain a good guess at the encoding of a bunch of bytes. What problems did you encounter when using the code from the documentation?

      Yes i tried this code, but I don't get anything back. The problem is we have some special characters always in our data files. Now we have already a recode to utf8, but it's very strange, when the original file is already utf8 and we do the recode also, we get very bad characters. I only need some code to detect if the file is already utf8, then we don't do the recode.

        ->getresult is documented as

        Returns the name of the detected charset or undef if no charset has (yet) been decided upon.

        So if you "don't get anything back", this is most likely because the detector has not yet seen enough data to determine whether the input is utf8 or something else.

        My advice is to look at your specific input data, and to remove the "special characters always in our data files", unless by these "special characters" you mean "text that we want to display". Once you have extracted the text in question, you can use some heuristics to find out if it looks like (valid) utf8, and then avoid double-encoding ("Mojibake").

        Finding out if some random byte sequences are valid utf8 is most easily done by taking some non-utf8-strings, encoding them and then dumping the bytes. If you check then your new data against those byte sequences, you can likely determine whether your input already has been encoded or not.

        You said:

        I only need some code to detect if the file is already utf8, then we don't do the recode.

        The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data.

        use Encode; open( my $fh, "<:raw", $filename ) or die; local $/; $_ = <$fh>; eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) }; if ( $@ ) { print "$filename is NOT UTF8\n"; } else { print "$filename IS UTF8\n"; }
        Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true.

        UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it.

Re: Detect the Charset of an file
by mtmcc (Hermit) on Oct 21, 2013 at 13:45 UTC
      Hello again, I tried the code: $d = new Encode::Detect::Detector; $d -> handle($file); ($file is my textfile) $charset = $d->getresult; print "*".$charset."*\n"; I'm getting: Undefined subroutine &main::handel called at .....
        The string "handel" doesn't appear in the code you posted -- in other words you didn't copy/paste your code with typos