in reply to Re^2: Why does Encode::Repair only correctly fix one of these two tandem characters?
in thread Why does Encode::Repair only correctly fix one of these two tandem characters?
No solution, just more on the “special.”
my $rdqm = "\N{RIGHT DOUBLE QUOTATION MARK}"; $rdqm = decode('Windows-1252', encode('UTF-8', $rdqm), Encode::FB_CRO +AK); __END__ cp1252 "\x9D" does not map to Unicode at .../Encode.pm line 176.
I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables. Don”t know if this particular case is already covered somewhere.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: Why does Encode::Repair only correctly fix one of these two tandem characters?
by Jim (Curate) on Aug 09, 2014 at 21:04 UTC | |
I’m guessing you’ve got an irreversible mojibake situation that will require custom code or lookup tables. In the specific case of the corpus of documents I need to repair (which, by the way, is a very common case), all mojibake are the characters in the Windows-1252 character encoding in the range from 80 through 9F. So I can repair the damaged characters with a small lookup table and a regular expression pattern that matches the substrings that are the damaged characters. Here's the script I'll use to repair the many text files with the UTF-8/Windows-1252 character encoding damage in them:
(As always, constructive criticism and earnest suggestions for improvement are welcome and appreciated.) Don”t know if this particular case is already covered somewhere. I'm a little surprised by this blind spot in Encode::Repair because, in my experience, this is by far the most ubiquitous kind of mojibake in Latin script text (i.e., text in Western European languages). In fairness to its author, moritz, the documentation includes the following Development section: Development The source code is stored in a public git repository at <http://github.com/moritz/Encode-Repair>. If you find any bugs, please used the issue tracker linked from this site. If you find a case of messed-up encodings that can be repaired deterministically and that's not covered by this module, please contact the author, providing a hex dump of both input and output, and as much information of the encoding and decoding process as you have. Patches are also very welcome. | [reply] [d/l] |
by ikegami (Patriarch) on Aug 11, 2014 at 02:02 UTC | |
It also finds that you have a problem. You can't tell the difference between the following cp1252 characters after they've gone through your encoding-decoding gauntlet:
Verification:
Note: I didn't have the tool check if one messed up sequence can be a substring of another messed up sequence. The sorting by descending length is there to try to handle that case if it exists. Upd: No such case exists. | [reply] [d/l] [select] |
by ikegami (Patriarch) on Aug 11, 2014 at 01:49 UTC | |
The most common garbage from Perl code is mixed UTF-8 and latin-1. It happens when you forgot to specify the output encoding.
The first string consists entirely of bytes, so Perl doesn't know you did something wrong. The second string makes no sense, so Perl guesses you meant to encode it using UTF-8. You end up with a mix of code points (effectively latin-1) and UTF-8. This is fixed using Encoding::FixLatin | [reply] [d/l] |