comment on

It sounds like the html data is already screwed up by the time you get it. U+0092 is a control character, not displayable. Meanwhile, a single-byte 0x92 is the cp1252 code point for "right single quotation mark".

You must first undo the mistake that has already been done, to get the data back to its original, honest cp1252, then convert from that to utf8 the right way.

The nature of the mistake is that the original data (wherever it may be) started out as cp1252 with some miscellaneous characters in the 0x80-0xFF range, then it went through some process (probably a (mod_)perl operation) that mistakenly assumed it was iso-8859-1, and this process "promoted" those above-ascii characters to unicode by adding a null high-byte (e.g. changing 0x92 to U+0092) -- actually, this mistake only causes a problem for characters in the range 0x80-0x9f, which iso-8859 and unicode define as esoteric control characters, while cp1252 uses most of them for "smart punctuation" and a few miscellaneous "extra" accented characters; the two encodings are identical over the 0xA0-0xFF range.

Anyway, getting the data back to "normal" is a little hard to grasp because of how perl handles codepoints and bytes in the 0x80-0xFF range -- I'm still learning the intricacies... Here are some one-liner commands to try things out:

# first, let's emulate what is showing up in the html data:

perl -e 'binmode STDOUT,":utf8"; print "\x92"' | od -txC
0000000    c2  92                                                     
+   
0000002

# now let's see how perl handles that as input:

perl -e 'binmode STDOUT,":utf8"; print "\x92"' |
 perl -le 'binmode STDIN,":utf8"; $_=<STDIN>; print;
          binmode STDOUT,":utf8"; print' | od -txC
0000000    92  0a  c2  92  0a                                         
+   
0000005

# perl's internal representation for "unicode" U+0080-U+00FF
# is really single bytes, and output to a non-utf8 file handle
# will be single bytes; but the utf8 flag is set, and output
# to a utf8 file handle will create "wide characters".

# Now, to do what really needs to be done in your case:

perl -e 'binmode STDOUT,":utf8"; print "\x92"' |
 perl -le 'use Encode; binmode STDIN,":utf8"; binmode STDOUT,":utf8";
  $_=<STDIN>; print;
  $_=encode("iso-8859-1",$_);
  $_=decode("cp1252",$_); print' | od -txC
0000000    c2  92  0a  e2  80  99  0a                                 
+   
0000007

# the three byte sequence "e2 80 99" is utf8 for U+2019,
# "right single quotation mark":

perl -e 'binmode STDOUT,":utf8"; print "\x{2019}"' | od -txC
0000000    e2  80  99                                                 
+   
0000003
[download]

What happens in that third (longest) command-line was that the script reads the data as utf8 (because that's what it really is), then turns it back (encodes it) into iso-8859-1 (because the process that is screwing things up assumed that encoding when it converted the original data to utf8); then, with the data back in its original single-byte encoding (which was really cp1252), it gets decoded again, using the appropriate code chart, into perl-internal utf8.

Or, you could just replace things with ascii-range equivalents... the following should handle the most common code points, assuming that you have read the html data as utf8:

tr/\x91-\x94\x96-\x98/''""--~/;
[download]

But that's not a complete solution; you might hit some codes in the 0x80-0x9f range that don't have ascii equivalents. Using Encode covers everything.

In reply to Re^4: Representing "binary" character in code? by graff
in thread Representing "binary" character in code? by robinbowes

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.