dominic01 has asked for the wisdom of the Perl Monks concerning the following question:

I have a huge log file which says it is utf-8 encoded. (Whne I open the file in notepad++, the botton right corner shows the encoding).

However some of the characters are non englissh characters and I can see them as it is and some of them are encoded to some format.

For example the one line has "Köhler" and another line as "K+¦hler". These logs are written by different systems and I do not what what kind of encoding is this? How do I decode "K+¦hler" into original string "Köhler"? Appreciate any help in this regard.

Replies are listed 'Best First'.
Re: Another Encoding decode query
by soundX (Acolyte) on Feb 08, 2015 at 08:25 UTC
    I'm relatively new to Perl so one of the monks may have a better idea but have you tried using Encoding::FixLatin, I've recently used this to fix similar encoding issues.
Re: Another Encoding decode query
by ikegami (Patriarch) on Feb 08, 2015 at 15:59 UTC
    "+¦" is suppose to represent "ö"? So at least 10 bytes? No one encoding would produce that. Even in UTF-8, ö only takes two bytes. Repeatedly encoding using UTF-8 doesn't result in anything like what you have either. (There would be repetition.)
      Between + and | there were actually 4 characters. When I pasted the chars in perlmonks, it further added 4 more characters. Hence you noted 10 characters in total.
Re: Another Encoding decode query
by i5513 (Pilgrim) on Feb 08, 2015 at 10:56 UTC

    Hello

    At your question:

    "These logs are written by different systems and I do not what what kind of encoding is this? "

    How did you get such file ? It seems like double encoded.

    You must ensure that every different system which is writting to such file are using the same encoding (maybe UTF8 is fine in your case)

    Please tell us if you get the work done. I think that a double encode file is very dificult to decode correctly without doing manual conversions.

    I would start filtering lines which contains letters not expected (add all the letters that you would expect on your file chomp; print if (/[^a-zA-Z0-9ÄÖÜäöüß,-_.\s]/); and then start to change specific words with s command.

    Always is interesting to know how such file cames to the live ! I had some of them because a wrong play of substitution (wrong iconv calls)

    Regards

Re: Another Encoding decode query
by Anonymous Monk on Feb 08, 2015 at 09:12 UTC

    Its very very easy, get the raw binary data, and Data::Dump::dd-er it

    Then do the same thing to "Köhler" as you encode it to all available Encode->encodings(q{:all})

    When you two ddumperings that match, you've found your encoding, you're an Encode::Detectiv.....

Re: Another Encoding decode query
by pme (Monsignor) on Feb 08, 2015 at 10:08 UTC
    What is your font setting in Notepad++? Have you tried 'Arial Unicode MS'?
Re: Another Encoding decode query
by Anonymous Monk on Feb 08, 2015 at 09:15 UTC
    (shrugs) None of the encoders known to Perl can decode it to anything meaningful. This thing was probably double encoded (perhaps by Notepad++). What does it look like in binary?
Re: Another Encoding decode query
by dominic01 (Sexton) on Feb 10, 2015 at 03:08 UTC
    I agree with many of you. The file is probably double encoded. Also from my analysis, noted that the "cat" or "more" redirection might have further changed the encoding. I am trying different option and I will come back with my findings. Note: When I typed encoding of character "ö", Perlmonks interface further adds 4 more characters. hence it looks odd in my original post.
Re: Another Encoding decode query
by dominic01 (Sexton) on Feb 10, 2015 at 05:12 UTC
    Here is the Hex view "Köhler": "4b c3 b6 68 6c 65 72" "K+¦hler": "4b 2b c3 83 c2 83 c3 a2 c2 80 c2 9a c3 83 c2 82 c3 82 c2 a6 68 6c 65 72" In another case I noted this. How to decode the following to "é". José:c3 83 c2 83 c3 82 c2 a9 José:c3 a9
      Here is with proper formatting

      "Köhler": "4b c3 b6 68 6c 65 72"
      "K+¦hler": "4b 2b c3 83 c2 83 c3 a2 c2 80 c2 9a c3 83 c2 82 c3 82 c2 a6 68 6c 65 72"

      José:c3 83 c2 83 c3 82 c2 a9
      José:c3 a9