deepakdjadhav has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse RSS Feed in perl scripts. But there is no encoding defined in the XML declaration and XML may contain non-unicode characters (Europian/japanese characters). Is there any to determine Encoding of a complete XML or encoding of strings/characters in perl? thanks DJ

Replies are listed 'Best First'.
Re: Encoding issue
by Corion (Patriarch) on Mar 09, 2007 at 17:12 UTC

    There is the Encode suite of modules. If you have an unknown encoding, Encode::Guess tries its best to guess the encoding.

Re: Encoding issue
by Juerd (Abbot) on Mar 09, 2007 at 20:55 UTC

    XML without an explicit encoding declaration must be UTF-8, but depending on your interpretation of the spec, UTF-16 and UTF-32 are allowed if you use a BOM.

    If your encoding is not UTF-8|16|32, you're out of luck. There is NO reliable way to guess charset.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Re: Encoding issue
by vrk (Chaplain) on Mar 09, 2007 at 18:47 UTC

    The problem with guessing character encoding is pretty impossible to solve accurately. It's easy to tell the difference between any of the /UTF-\d\d?/ encodings and ASCII, and it's still pretty easy to tell the difference between ASCII and an 8-bit character set or 8-bit character set and /UTF-\d\d?/. Problems begin when you have 8-bit encoded data and you try to guess which one of the dozens of 8-bit encodings it is.

    For example, the ISO-8859 series contains no less than fifteen different encodings, plus all old IBM/Microsoft code pages (CP850, CP437; remember DOS?), encodings used by Windows (such as CP-1252), and so on, not to even mention Asian and East European encodings! In a word: good luck.

    (I think Shift JIS is also easy to differentiate from ASCII and UTF, but I have no experience with it. It's 8-bit, but uses two bytes for non-ASCII characters.)

    --
    print "Just Another Perl Adept\n";

Re: Encoding issue
by bart (Canon) on Mar 10, 2007 at 11:58 UTC
    "RSS feeds" implies HTTP to me. So, take a look at the HTTP headers when you're fetching the file.