The first thing I tried was flattening out those HTML-escapes: s/&(\d+);/chr $1/ge/. I then realized that printing that to the WinME console wouldn't do much good, I put it into an HTML file, viewed it with Mozilla, and set the character encoding to UTF-8. I got siaohtal.om@blahblah.co.jp. And if you can view that, you could have read the original, so that approach isn't much good, but it gives us a starting point.
From there, I looked up what unicode.org has to say about the characters. The code charts are orginized by hex codes, so convert one -- the first, 65363, is U+FF53. So I went from there to Code Charts, and found that the greatest starting point less then FF53 was FF00, Halfwidth and Fullwidth Forms, and that FF53 is "FULLWIDTH LATIN SMALL LETTER S", which maps to U+0073, "s". In fact, FF01-FF5E are annotated "see ASCII 0020-007E", and they all map out nicely.
Thus, we can take each of the high characters, and map them to their non-wide low eqivelents. The final code is:
$_='&65363;&65353;&65345;&65359;&65352;&65364;&65345;&65356;.&65359;m@
+blahblah.co.jp';
s/&(\d+);/chr($1-0xFF00+0x20)/ge;
print $_;
Oh, but we're forgetting somthing: these aren't the only case where you might get &\d+; things:
$_='&65363;&65353;&65345;&65359;&65352;&65364;&65345;&65356;.&65359;m@
+blahblah.co.jp';
s/&(\d+);/($1>0xFF00 && $1<0xFF5F) ? chr($1-0xFF00+0x20) : $1/ge;
print $_;
Update: This has a bug: it eats some of the ascii at the end of your input string. I'll fix it after dinner.
Found it. Oldest in the book: I used double-quotes instead of single-quotes around an email-address, so it tried to interpolate @blahblah... Had I run this with warnings, I would have gotten a nice warning telling me exactly what I was doing wrong. Fixed in the above.
Confession: It does an Immortal Body good.
|