Re: Detect the Charset of an file

Replies are listed 'Best First'.
Re^2: Detect the Charset of an file by endymion (Acolyte) on Oct 21, 2013 at 13:52 UTC
Yes i tried this code, but I don't get anything back. The problem is we have some special characters always in our data files. Now we have already a recode to utf8, but it's very strange, when the original file is already utf8 and we do the recode also, we get very bad characters. I only need some code to detect if the file is already utf8, then we don't do the recode.	[reply]
Re^3: Detect the Charset of an file by Corion (Patriarch) on Oct 21, 2013 at 13:59 UTC
`->getresult` is documented as Returns the name of the detected charset or undef if no charset has (yet) been decided upon. So if you "don't get anything back", this is most likely because the detector has not yet seen enough data to determine whether the input is utf8 or something else. My advice is to look at your specific input data, and to remove the "special characters always in our data files", unless by these "special characters" you mean "text that we want to display". Once you have extracted the text in question, you can use some heuristics to find out if it looks like (valid) utf8, and then avoid double-encoding ("Mojibake"). Finding out if some random byte sequences are valid utf8 is most easily done by taking some non-utf8-strings, encoding them and then dumping the bytes. If you check then your new data against those byte sequences, you can likely determine whether your input already has been encoded or not.	[reply] [d/l]
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 21, 2013 at 14:14 UTC
I try this tomorrow. Maybe it works. Very bad that it doesn't work with sending the whole file $file as argument to the module.	[reply]
Re^3: Detect the Charset of an file by graff (Chancellor) on Oct 23, 2013 at 05:21 UTC
You said: I only need some code to detect if the file is already utf8, then we don't do the recode. The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data. `use Encode; open( my $fh, "<:raw", $filename ) or die; local $/; $_ = <$fh>; eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) }; if ( $@ ) { print "$filename is NOT UTF8\n"; } else { print "$filename IS UTF8\n"; }` [download] Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true. UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it.	[reply] [d/l]
Re^4: Detect the Charset of an file by endymion (Acolyte) on Oct 23, 2013 at 06:48 UTC
Hello graff, I tried your great stuff, but I get another bug. I'll try now with file -i with system.	[reply]
Re^5: Detect the Charset of an file by graff (Chancellor) on Oct 23, 2013 at 21:00 UTC
Re^4: Detect the Charset of an file by Anonymous Monk on Oct 24, 2013 at 13:47 UTC
No Problem. I have seen this by myself and fixed it in the script. Maybe your great script helps others with the same problems. I have solved it with system file -i, works great and I have no problem with the xml parser. Thanks for your great help.	[reply]