in reply to Re^2: Representing "binary" character in code?
in thread Representing "binary" character in code?

OK, I've had another stab at this.

I used wget to pull down the proxied page that shows the problem.

Looking at the html file with less, I see this where the "bad" char is:

| <a href="/2/about.html">About Us</a> | <a href="/2/service.h +tml">We <U+0092>re About Service</a>
If I hexdump the file, the same fragment looks like this:

00002110 2f 32 2f 73 65 72 76 69 63 65 2e 68 74 6d 6c 22 |/2/servic +e.html"| 00002120 3e 57 65 c2 92 72 65 20 0d 0a 20 20 20 20 20 20 |>We..re . +. | 00002130 20 20 41 62 6f 75 74 20 53 65 72 76 69 63 65 3c | About S +ervice<|
Now, if I copy the <U+0092> binary character directly from my browser and paste it into the perl script, the search and replace works - it finds that character and I can replace it with the correct HTML entity - &#8217; in this case.

Here's some more code. This is how I am generating the rules, and using the utf8 character:

my @rules = ( { From => qq{<92>}, To => q{&#8217;}, Flags => q{he}, }
In the example above, <92> is a binary character that shows up in vim as "<92>". If I use less to display the file, it shows up as <U+0092> What I have so far failed to do is to create that binary character programmatically, i.e. using \x{} escapes, or pack(...) or any other techniques.

It seems that if I use the utf8 character directly, perl does the write thing when I print it, but when I try to create the character indirectly I never qute get it right.

R.

--

Robin Bowes | http://robinbowes.com

Replies are listed 'Best First'.
Re^4: Representing "binary" character in code?
by graff (Chancellor) on Nov 06, 2006 at 02:27 UTC
    It sounds like the html data is already screwed up by the time you get it. U+0092 is a control character, not displayable. Meanwhile, a single-byte 0x92 is the cp1252 code point for "right single quotation mark".

    You must first undo the mistake that has already been done, to get the data back to its original, honest cp1252, then convert from that to utf8 the right way.

    The nature of the mistake is that the original data (wherever it may be) started out as cp1252 with some miscellaneous characters in the 0x80-0xFF range, then it went through some process (probably a (mod_)perl operation) that mistakenly assumed it was iso-8859-1, and this process "promoted" those above-ascii characters to unicode by adding a null high-byte (e.g. changing 0x92 to U+0092) -- actually, this mistake only causes a problem for characters in the range 0x80-0x9f, which iso-8859 and unicode define as esoteric control characters, while cp1252 uses most of them for "smart punctuation" and a few miscellaneous "extra" accented characters; the two encodings are identical over the 0xA0-0xFF range.

    Anyway, getting the data back to "normal" is a little hard to grasp because of how perl handles codepoints and bytes in the 0x80-0xFF range -- I'm still learning the intricacies... Here are some one-liner commands to try things out:

    # first, let's emulate what is showing up in the html data: perl -e 'binmode STDOUT,":utf8"; print "\x92"' | od -txC 0000000 c2 92 + 0000002 # now let's see how perl handles that as input: perl -e 'binmode STDOUT,":utf8"; print "\x92"' | perl -le 'binmode STDIN,":utf8"; $_=<STDIN>; print; binmode STDOUT,":utf8"; print' | od -txC 0000000 92 0a c2 92 0a + 0000005 # perl's internal representation for "unicode" U+0080-U+00FF # is really single bytes, and output to a non-utf8 file handle # will be single bytes; but the utf8 flag is set, and output # to a utf8 file handle will create "wide characters". # Now, to do what really needs to be done in your case: perl -e 'binmode STDOUT,":utf8"; print "\x92"' | perl -le 'use Encode; binmode STDIN,":utf8"; binmode STDOUT,":utf8"; $_=<STDIN>; print; $_=encode("iso-8859-1",$_); $_=decode("cp1252",$_); print' | od -txC 0000000 c2 92 0a e2 80 99 0a + 0000007 # the three byte sequence "e2 80 99" is utf8 for U+2019, # "right single quotation mark": perl -e 'binmode STDOUT,":utf8"; print "\x{2019}"' | od -txC 0000000 e2 80 99 + 0000003
    What happens in that third (longest) command-line was that the script reads the data as utf8 (because that's what it really is), then turns it back (encodes it) into iso-8859-1 (because the process that is screwing things up assumed that encoding when it converted the original data to utf8); then, with the data back in its original single-byte encoding (which was really cp1252), it gets decoded again, using the appropriate code chart, into perl-internal utf8.

    Or, you could just replace things with ascii-range equivalents... the following should handle the most common code points, assuming that you have read the html data as utf8:

    tr/\x91-\x94\x96-\x98/''""--~/;
    But that's not a complete solution; you might hit some codes in the 0x80-0x9f range that don't have ascii equivalents. Using Encode covers everything.