Detect the Charset of an file

endymion has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Detect the Charset of an file by Corion (Patriarch) on Oct 21, 2013 at 13:40 UTC
Have you looked at the `SYNOPSIS` section in Encode::Detect::Detector? It seems quite complete to me and suggests a way to obtain a good guess at the encoding of a bunch of bytes. What problems did you encounter when using the code from the documentation?	[reply] [d/l]
Re^2: Detect the Charset of an file by endymion (Acolyte) on Oct 21, 2013 at 13:52 UTC
Yes i tried this code, but I don't get anything back. The problem is we have some special characters always in our data files. Now we have already a recode to utf8, but it's very strange, when the original file is already utf8 and we do the recode also, we get very bad characters. I only need some code to detect if the file is already utf8, then we don't do the recode.	[reply]
Re^3: Detect the Charset of an file by Corion (Patriarch) on Oct 21, 2013 at 13:59 UTC
`->getresult` is documented as Returns the name of the detected charset or undef if no charset has (yet) been decided upon. So if you "don't get anything back", this is most likely because the detector has not yet seen enough data to determine whether the input is utf8 or something else. My advice is to look at your specific input data, and to remove the "special characters always in our data files", unless by these "special characters" you mean "text that we want to display". Once you have extracted the text in question, you can use some heuristics to find out if it looks like (valid) utf8, and then avoid double-encoding ("Mojibake"). Finding out if some random byte sequences are valid utf8 is most easily done by taking some non-utf8-strings, encoding them and then dumping the bytes. If you check then your new data against those byte sequences, you can likely determine whether your input already has been encoded or not.	[reply] [d/l]
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 21, 2013 at 14:14 UTC
Re^3: Detect the Charset of an file by graff (Chancellor) on Oct 23, 2013 at 05:21 UTC
You said: I only need some code to detect if the file is already utf8, then we don't do the recode. The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data. `use Encode; open( my $fh, "<:raw", $filename ) or die; local $/; $_ = <$fh>; eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) }; if ( $@ ) { print "$filename is NOT UTF8\n"; } else { print "$filename IS UTF8\n"; }` [download] Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true. UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it.	[reply] [d/l]
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 23, 2013 at 06:48 UTC
Re^5: Detect the Charset of an file by graff (Chancellor) on Oct 23, 2013 at 21:00 UTC
Re^4: Detect the Charset of an file by Anonymous Monk on Oct 24, 2013 at 13:47 UTC
Re: Detect the Charset of an file by mtmcc (Hermit) on Oct 21, 2013 at 13:45 UTC
Encode::Guess might be worth a look; as may this previous posting: What encoding am I (probably) using?	[reply]
Re^2: Detect the Charset of an file by endymion (Acolyte) on Oct 22, 2013 at 07:12 UTC
Hello again, I tried the code: $d = new Encode::Detect::Detector; $d -> handle($file); ($file is my textfile) $charset = $d->getresult; print "".$charset."\n"; I'm getting: Undefined subroutine &main::handel called at .....	[reply]
Re^3: Detect the Charset of an file by Anonymous Monk on Oct 22, 2013 at 07:17 UTC
The string "handel" doesn't appear in the code you posted -- in other words you didn't copy/paste your code with typos	[reply]
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 22, 2013 at 07:24 UTC
Re^5: Detect the Charset of an file by Anonymous Monk on Oct 22, 2013 at 07:38 UTC
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 22, 2013 at 08:10 UTC
Re^5: Detect the Charset of an file by Anonymous Monk on Oct 22, 2013 at 08:17 UTC
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 22, 2013 at 08:32 UTC
Re^5: Detect the Charset of an file by Anonymous Monk on Oct 22, 2013 at 08:38 UTC