When you say you "hex-dumped the source file" (and saw "c2 9f"), were you talking about the original html file (and "c2 9f" was/were the "bad" characters)? Or were you talking about your perl script? If you were talking about your perl script (which is what I'm guessing), then what do the "bad" characters in the html file look like when you hex dump that?
Let's suppose the html file has a literal "0x9f" character ("capital letter Y with diaeresis" in the Windows CP1252 encoding). Let's also suppose that you actually want this converted to the utf8 encoding for this letter:
And another way to do that, without using Encode:use Encode; # ... read the html file into $html, and then: from_to( $html, "cp1252", "utf8" ); # now $html contains utf8 data instead of cp1252 data
If you are using a utf8 text editor to create your scripts, and you try to put literal wide characters within quoted strings in your script, you'll want to say "use utf8;" next to "use strict;", so that the perl interpreter will know that the script itself contains utf8 wide characters. That way, as your quoted strings are assigned to variables, those variables will have their "utf8 flag" set. This is important when you set an output file handle to utf8 mode: scalars with the utf8 flag will be output correctly as utf8 data.open( HTML, "<:encoding(cp1252)", $filename ); # now text will be converted from cp1252 to utf8 # as it is read from the file.
If a scalar contains some bytes with the 8th-bit set, but the utf8 flag is not set, printing the string to a utf8-mode file will cause those bytes to be interpreted as "Latin-1" single-byte characters, and they will be "promoted" to utf8 wide characters -- e.g. 0x9f becomes the two-byte sequence "c2 9f"; another example: the two byte sequence "c2 9f" becomes the four-byte seqeunce "c3 82 c2 9f". (Look at perldoc perlunicode, and find the section titled "Unicode Encodings" to see the reasoning behind that).
In reply to Re: Representing "binary" character in code?
by graff
in thread Representing "binary" character in code?
by robinbowes
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |