in reply to Encoding issue

(I wonder how many different anonymonks have been posting in this thread?)

When the character "lower case o with acute accent" is encoded in iso-8859-1 or cp1252, it is the single byte 0xF3.

The unicode code point for that letter is 0x00F3, and when this is encoded in utf8, it is a 2-byte sequence: 0xC3 0xB3.

(In case you want to understand how 0x00F3 relates to the utf8 two-byte 0xC3 0xB3, look at the perlunicode man page, way down in the section titled "Unicode Encodings".)

When that utf8 2-byte sequence is displayed by any tool that uses iso-8859-1 or cp1252, instead of showing "lower case o with acute accent", you will see the two characters "upper case A with tilde" followed by "superscript 3", because in those single-byte encodings, those are the characters assigned to the code points 0xC3 and 0xB3, respectively. In other words, the two utf8 bytes are being misinterpreted as separate single-byte code points.

To see the text displayed properly, use a display tool that knows how to handle utf8 encoding. If you would rather have your text data stored in a single-byte encoding, you can try something like this:

#!/usr/bin/perl use LWP::Simple; use Encode 'from_to'; $_ = get("http://www.example.com"); from_to( $_, 'utf8', 'cp1252' ); open(O,">","test.txt") or die "open failed on test.txt: $!"; print O;
But be aware that your web page might come with utf8 characters that fall outside the range of your chosen single-byte code page. When that happens, your output file will losing some data, because each character that cannot be converted will be replaced by a question mark character.

Replies are listed 'Best First'.
Re^2: Encoding issue
by Anonymous Monk on Apr 01, 2010 at 04:51 UTC
    pasteing the the character in windoes file looks like this
    Limón
    But in the Linux vi editor, it
    Limón
      That is exactly what I would expect. The fact that one text editor or display tool shows the text the way you expect, and vi shows it some other way, doesn't alter the data itself in any way. It's the same sequence of bytes in each case, but the non-ASCII bytes are just being interpreted in two different ways.

      Check your docs for vi to see whether it offers any method for treating data as utf8-encoded. If it doesn't, you'll just need to accept the fact that "wide" (multi-byte) utf8 characters in your text file will show up as multiple single-byte characters (in the non-ASCII 0x80-0xff range) when you look at them in vi.