in reply to Encoding problem

Could you provide a sample of the file (as seen by a hex/oct dumper, preferably)? It sounds like your file contains strings of text encoded using more than one encoding, which no indication of which encoding is used for which string.

If so, your file is messed up. Do you have the data needed to rebuild a sane file?

If not, it may still be possible to make a fairly accurate guess of the encoding used for a span of text given the information you gave. It would help if we saw a sample of this file.

Replies are listed 'Best First'.
Re^2: Encoding problem
by grscott (Novice) on May 08, 2009 at 18:44 UTC
    Agreed, the file IS messed up - not my idea, honest! :-) And I hope that I can get something less bizarre in the course of time, but that probably wont be for a while.

    Can't append a sample, sadly, as the file is at work, and I'm not.

    Basically, I have been working along the lines of trying to hit on a 'use open' line that would figure out the weird input format, and let me output to something more sensible; something like:

    use open "IN" => ":encoding(iso-8895-1):encoding(utf8)", "OUT" => ":encoding(utf8)";
    But, as I say, really not sure what I am doing with this - are they the right values? Are they in the right sequence? Do I need anything else?? Not a clue, quite frankly! Would be something to know that I am / am not on the right lines, at least.

    Cheers,

    GRS

      It depends whether the data is double encoded, or whether you different encodings are used for different parts of the file. Thus my request for a sample of the file. I suspect the latter.

      Using :encoding twice (assuming it works at all) would only help the former case. The order for decoding would be the opposite order used for encoding.

      The latter case would involve looking at each byte or group of bytes and making guesses.

      PS — Don't use UTF8 (an encoding known only to Perl) when decoding. That leaves you open to a vulnerability. Use UTF-8 instead.

      Update: Using :encoding twice doesn't always work if ever. You'll need to use decode($enc1, decode($enc2, $_)) if your text is double-encoded.

        Thank you for the prompt response, Ikegami. A couple of very helpful insights there, which I shall attempt to make use of as soon as possible.

        BTW, I am pretty confident that the entire file has been double encoded. I sure hope, anyway, that that is as bad as it gets... :-)