anlamarama has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have lots of text files and am inserting them into a database. The textfiles' encoding is different from each other. Some of them are UTF8, some iso-8859-9 and others cp1254. I am loading text file into a variable (this is the only way currently), and updating a row in the database. If it is UTF8, it should be inserted without changing the encoding. If it is cp1254 or iso-8859-9 , then I need to decode the data first. However, I have no idea what the encoding is. Is there any way to determine the encoding? I will update 150.000 rows, so I would like to reduce the potential errors as much as I can.

I tried Encode::Guess,

my $decoder = guess_encoding($data, qw/iso-8859-9 cp1254/);

It says: "iso-8859-9 or cp1254", but the correct encoding is cp1254. So, it is also not useful.

What do you suggest? Are there any workarounds or solution for this?

Thanks in advance,

Replies are listed 'Best First'.
Re: Encoding Problem
by graff (Chancellor) on Nov 13, 2009 at 02:59 UTC
    Check the respective code page listings, which you can find here: http://www.unicode.org/Public/MAPPINGS/ISO8859/ and here: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/. It turns out that cp1254 and 8859-9 are the same set of characters -- the only difference is that all the cp12* pages cram stuff into the 0x80-0x9f range, where the 8859-* pages just have "control characters" (effectively nothing useful).

    So if you use cp1254 for everything that isn't utf8, you should be fine -- the 8859-9 data will be using a subset of the characters defined by the cp1254 table. (And it's easy to tell whether something is utf8 or not: try to decode it as if it were utf8, and if that fails, you know it isn't utf8.)

      Thanks, I did not know that.

      However, I tried to decode cp1254 encoded data with iso-8859-9, it gave me garbled text. I have tried it again, and it works, you are right. I mixed encodings or files when trying I guess.(lots of files and encodings etc) Sorry for that really.

      So, I guess I should have not blamed Encode::Guess as well. :)

      Thanks again,

        I tried to decode cp1254 encoded data with iso-8859-9, it gave me garbled text.

        If you had been asking for errors or warnings from Encode, it would have given you those as well.

        Make sure you understand the "superset/subset" relation: cp1254 is a superset of 8859-9 (8859-9 is a subset of cp1254), which means that treating cp1254 data as if it were 8859-9 data is likely to fail, whereas treating 8859-9 data as if it were cp1254 will not fail.

        And yes, Encode::Guess was apparently doing the right thing and giving you the correct answer, if the text you gave it happened to actually be 8859-9 (because such text could also be cp1254). But if you gave it single-byte-per-character text that included a lot of bytes in the 0x80-0x9f range, and it said "this could be 8859-9", I would call that a disappointing mistake.

Re: Encoding Problem
by anlamarama (Acolyte) on Nov 13, 2009 at 05:58 UTC

    I guess I found what I was doing wrong.(why it gave me garbled text when decoding cp1254 encoded data with iso-8859-9)

    It could be useful for other people in the future. (It's a silly mistake though)

    I was opening the file with this argument

    open FILE, "+<file.txt";

    I load the file content into a variable as I said before. (I need to do that in order to update the row) After I decode the data, I was writing the decoded data into the same file. So the encoding stays same actually on that file. We need to delete the file and create new one. That's the problem.

    (or we can use something like iconv, but it won't make any difference I guess, since we need to load the data into a variable to determine its encoding and update the databasae)

    Thanks again,