Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to deal with html files mixed with utf-8 and 2-byte coding schema, but the file encoding information appear in file may be inaccurate or missing, so I need to do it by perl. Any suggestion?

Replies are listed 'Best First'.
Re: Encoding Detection in perl?
by moritz (Cardinal) on Feb 22, 2008 at 10:52 UTC
    You can use Encode::Guess for some heuristics, and Encode to verify your guess.

    For some encodings that are commonly used in Asia there are special modules. search.cpan.org ist your friend.

      I have tried Encode::Guess before but seems utf8 knock it down...any more suggestion?
        "knock it down" ! eq "The error message is (....)" nor "The output varies from what I expected; (samples).

        "tried Encode::Guess...." ! eq (the code you wrote)

        A snippet of the code that failed (boil it down, check that that is indeed what leads to the failure) and the exact output will get you better answers.

        In my experience it works fine with UTF-8 (although it needs quite a bit of data to work reliably).

        Could you show us some example code, and the data that "knocked" it down? (perhaps post a hexdump of the data here, I don't think perlmonks is binary-safe ;-)

        And if it dies for some data, maybe you should write a bug report.