Yes i tried this code, but I don't get anything back. The problem is we have some special characters always in our data files. Now we have already a recode to utf8, but it's very strange, when the original file is already utf8 and we do the recode also, we get very bad characters. I only need some code to detect if the file is already utf8, then we don't do the recode.
| [reply] |
->getresult is documented as
Returns the name of the detected charset or undef if no charset has (yet) been decided upon.
So if you "don't get anything back", this is most likely because the detector has not yet seen enough data to determine whether the input is utf8 or something else.
My advice is to look at your specific input data, and to remove the "special characters always in our data files", unless by these "special characters" you mean "text that we want to display". Once you have extracted the text in question, you can use some heuristics to find out if it looks like (valid) utf8, and then avoid double-encoding ("Mojibake").
Finding out if some random byte sequences are valid utf8 is most easily done by taking some non-utf8-strings, encoding them and then dumping the bytes. If you check then your new data against those byte sequences, you can likely determine whether your input already has been encoded or not.
| [reply] [d/l] |
I try this tomorrow. Maybe it works. Very bad that it doesn't work with sending the whole file $file as argument to the module.
| [reply] |
You said:
I only need some code to detect if the file is already utf8, then we don't do the recode.
The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data.
use Encode;
open( my $fh, "<:raw", $filename ) or die;
local $/;
$_ = <$fh>;
eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) };
if ( $@ ) {
print "$filename is NOT UTF8\n";
}
else {
print "$filename IS UTF8\n";
}
Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true.
UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it. | [reply] [d/l] |
Hello graff,
I tried your great stuff, but I get another bug. I'll try now with file -i with system.
| [reply] |
No Problem. I have seen this by myself and fixed it in the script. Maybe your great script helps others with the same problems. I have solved it with system file -i, works great and I have no problem with the xml parser.
Thanks for your great help.
| [reply] |