in reply to What encoding am I (probably) using?
I want to use Encode::from_to(...) to put everything into iso-8859-1 in (probable) good form.No. If you're expecting to pull in data from various web sites that might use several different single-byte legacy encodings, most of them will not be directly mappable to iso-8859-1. The whole problem with the legacy single-byte encodings is that, to the extent they differ from one another, you cannot map from one to another without losing some characters.
Actually, to the extent that some 8-bit encodings cover fewer displayable characters than others (e.g. iso-8859-* never use 0x80-0x9f for displayable characters, whereas the Windows and Mac code pages always do), loss of information might only happen in one direction. But if your "from" encoding happens to be 8859-2 and your "to" encoding happens to be 8859-1, the conversion simply cannot work.
So, always convert from some non-unicode encoding to utf8. As for guessing correctly from among several 8-bit code pages that cover different latin-alphabet-based languages, the sad truth remains that Encode::Guess will have a hard time getting it right. You need a certain amount of language modeling data (validated by manual inspection and labeling as to language and character set) and some simple statistics on your unknown input data in order to make a proper guess.
|
|---|