comment on

I think this is rephrasing GrandFather's question, but... What do you want as a replacement for the "bad" characters?

When you say you "hex-dumped the source file" (and saw "c2 9f"), were you talking about the original html file (and "c2 9f" was/were the "bad" characters)? Or were you talking about your perl script? If you were talking about your perl script (which is what I'm guessing), then what do the "bad" characters in the html file look like when you hex dump that?

Let's suppose the html file has a literal "0x9f" character ("capital letter Y with diaeresis" in the Windows CP1252 encoding). Let's also suppose that you actually want this converted to the utf8 encoding for this letter:


use Encode;

# ... read the html file into $html, and then:

from_to( $html, "cp1252", "utf8" );

# now $html contains utf8 data instead of cp1252 data
[download]

And another way to do that, without using Encode:


open( HTML, "<:encoding(cp1252)", $filename );

# now text will be converted from cp1252 to utf8
# as it is read from the file.
[download]

If you are using a utf8 text editor to create your scripts, and you try to put literal wide characters within quoted strings in your script, you'll want to say "use utf8;" next to "use strict;", so that the perl interpreter will know that the script itself contains utf8 wide characters. That way, as your quoted strings are assigned to variables, those variables will have their "utf8 flag" set. This is important when you set an output file handle to utf8 mode: scalars with the utf8 flag will be output correctly as utf8 data.

If a scalar contains some bytes with the 8th-bit set, but the utf8 flag is not set, printing the string to a utf8-mode file will cause those bytes to be interpreted as "Latin-1" single-byte characters, and they will be "promoted" to utf8 wide characters -- e.g. 0x9f becomes the two-byte sequence "c2 9f"; another example: the two byte sequence "c2 9f" becomes the four-byte seqeunce "c3 82 c2 9f". (Look at perldoc perlunicode, and find the section titled "Unicode Encodings" to see the reasoning behind that).

In reply to Re: Representing "binary" character in code? by graff
in thread Representing "binary" character in code? by robinbowes

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.