I have this character encoding damaged text. It's gibberish, not Chinese.
敒›剕䕇呎
U+6552 CJK UNIFIED IDEOGRAPH-6552
U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
U+5255 CJK UNIFIED IDEOGRAPH-5255
U+4547 CJK UNIFIED IDEOGRAPH-4547
U+544E CJK UNIFIED IDEOGRAPH-544E
I know this is the original, undamaged text.
Re: URGENT
I've determined how the damage occurred. The original ten characters were ASCII (UTF-8), but they were mistakenly interpreted as UCS-2LE. Then they were petrified as five bogus characters (mojibake) in Unicode (UTF-8). This is essentially like what happens in the case of the infamous Bush hid the facts bug in Microsoft Notepad.
Here's the pattern.
R e : U R G E N T
52 65 3A 20 55 52 47 45 4E 54
U+6552 U+203A U+5255 U+4547 U+544E
敒 › 剕 䕇 呎
How can I reverse this character encoding damage using Perl? I tried using Encode::Repair, but I couldn't get it to work. It seems to me this repair job should be easily accomplished using pack/unpack, but those two functions have always confounded me. I need guidance.
UPDATE: Here's what I've managed to cobble together. It works, but I'm not impressed. Surely there's a better way.
use v5.16;
use strict;
use warnings;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
my $damaged_text = '敒›剕䕇呎';
my $repaired_text = '';
while ($damaged_text =~ m/(\X)/g) {
my ($msb, $lsb) = unpack 'A2A2', sprintf "%04x", ord $1;
$repaired_text .= chr(hex $lsb) . chr(hex $msb);
}
say $repaired_text; # Prints 'Re: URGENT'
(I had to use <pre> tags instead of <code> tags here because of the Chinese characters in the script.)
In reply to How to Fix Character Encoding Damaged Text Using Perl? by Jim
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |