Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^2: UTF8 Validity

by menolly (Hermit)
on Feb 22, 2008 at 00:47 UTC ( [id://669440] : note . print w/replies, xml ) Need Help??

in reply to Re: UTF8 Validity
in thread UTF8 Validity

Thanks; that's the kind of pointer I need. Most of my non-ASCII/non-UTF8 data is either in contact data or easily connected to contact data, so I've been trying to guess the charset based on the geographic origin, with mixed results. I definitely have multiple encodings present -- so far, there's cp1251 (Cyrillic), latin1, some form of Japanese, and something I can't identify but have scrubbed out in the source DB.

Replies are listed 'Best First'.
Re^3: UTF8 Validity
by graff (Chancellor) on Feb 22, 2008 at 02:18 UTC
    Encode::Guess is likely to be helpful for figuring out the source encodings for many of the Asian (multi-byte-char) strings, though it might not help much for distinguishing among single-byte encodings. Worth a try.

      Encode::Guess is lame because the user needs to tell it which encoding the binary is.

      Use Encode::Detect instead. This is the same detector used in Mozilla browsers.

        I've been using Encode::Guess, but have had trouble building a suspects list for some data. However, Firefox hasn't been able to appropriately handle the problem data, either, so if Encode::Detect is the same method, I doubt it would've done any better on this data.