When the character "lower case o with acute accent" is encoded in iso-8859-1 or cp1252, it is the single byte 0xF3.
The unicode code point for that letter is 0x00F3, and when this is encoded in utf8, it is a 2-byte sequence: 0xC3 0xB3.
(In case you want to understand how 0x00F3 relates to the utf8 two-byte 0xC3 0xB3, look at the perlunicode man page, way down in the section titled "Unicode Encodings".)
When that utf8 2-byte sequence is displayed by any tool that uses iso-8859-1 or cp1252, instead of showing "lower case o with acute accent", you will see the two characters "upper case A with tilde" followed by "superscript 3", because in those single-byte encodings, those are the characters assigned to the code points 0xC3 and 0xB3, respectively. In other words, the two utf8 bytes are being misinterpreted as separate single-byte code points.
To see the text displayed properly, use a display tool that knows how to handle utf8 encoding. If you would rather have your text data stored in a single-byte encoding, you can try something like this:
But be aware that your web page might come with utf8 characters that fall outside the range of your chosen single-byte code page. When that happens, your output file will losing some data, because each character that cannot be converted will be replaced by a question mark character.#!/usr/bin/perl use LWP::Simple; use Encode 'from_to'; $_ = get("http://www.example.com"); from_to( $_, 'utf8', 'cp1252' ); open(O,">","test.txt") or die "open failed on test.txt: $!"; print O;
In reply to Re: Encoding issue
by graff
in thread Encoding issue
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |