Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

If the character is present in the web,
Limón
When i download the contents to a file, using perl LWP module,
#!/usr/bin/perl use LWP::Simple; $status = getstore("http://www.example.com","test.txt");
The test.txt has the characters Limón. How to avoid the characters converting to this way.

Replies are listed 'Best First'.
Re: Encoding issue
by graff (Chancellor) on Mar 30, 2010 at 12:34 UTC
    (I wonder how many different anonymonks have been posting in this thread?)

    When the character "lower case o with acute accent" is encoded in iso-8859-1 or cp1252, it is the single byte 0xF3.

    The unicode code point for that letter is 0x00F3, and when this is encoded in utf8, it is a 2-byte sequence: 0xC3 0xB3.

    (In case you want to understand how 0x00F3 relates to the utf8 two-byte 0xC3 0xB3, look at the perlunicode man page, way down in the section titled "Unicode Encodings".)

    When that utf8 2-byte sequence is displayed by any tool that uses iso-8859-1 or cp1252, instead of showing "lower case o with acute accent", you will see the two characters "upper case A with tilde" followed by "superscript 3", because in those single-byte encodings, those are the characters assigned to the code points 0xC3 and 0xB3, respectively. In other words, the two utf8 bytes are being misinterpreted as separate single-byte code points.

    To see the text displayed properly, use a display tool that knows how to handle utf8 encoding. If you would rather have your text data stored in a single-byte encoding, you can try something like this:

    #!/usr/bin/perl use LWP::Simple; use Encode 'from_to'; $_ = get("http://www.example.com"); from_to( $_, 'utf8', 'cp1252' ); open(O,">","test.txt") or die "open failed on test.txt: $!"; print O;
    But be aware that your web page might come with utf8 characters that fall outside the range of your chosen single-byte code page. When that happens, your output file will losing some data, because each character that cannot be converted will be replaced by a question mark character.
      pasteing the the character in windoes file looks like this
      Limón
      But in the Linux vi editor, it
      Limón
        That is exactly what I would expect. The fact that one text editor or display tool shows the text the way you expect, and vi shows it some other way, doesn't alter the data itself in any way. It's the same sequence of bytes in each case, but the non-ASCII bytes are just being interpreted in two different ways.

        Check your docs for vi to see whether it offers any method for treating data as utf8-encoded. If it doesn't, you'll just need to accept the fact that "wide" (multi-byte) utf8 characters in your text file will show up as multiple single-byte characters (in the non-ASCII 0x80-0xff range) when you look at them in vi.

Re: Encoding issue
by ikegami (Patriarch) on Apr 02, 2010 at 03:58 UTC

    getstore stores the file exactly as it was provided. If it's text, and if you want convert the character encoding to what you use locally (i.e. decoding using the current encoding, then encoding using the desired encoding), it's up to you to do so.

    If the document in question is an HTML document, the encoding should appear in the Content-Type HTTP header. It might also appear in a META http-equiv element inside the document. If you convert the encoding, don't forget to adjust the META element.

Re: Encoding issue
by Anonymous Monk on Mar 30, 2010 at 09:13 UTC
    The test.txt has the characters Limón. How to avoid the characters converting to this way.

    There is no converting. You need to view test.txt with something capable of interpreting its encoding, if its ASCII or UTF-8

      LWP wont change the encoding, but your console might display weird stuff, for example cmd.exe displays
      $ chcp Active code page: 437 $ echo Limón |od -tacx1 0000000 L i m " n sp sp cr nl L i m ó n \r \n 4c 69 6d a2 6e 20 20 0d 0a 0000011 $ perl -le" print for @ARGV" Limón |od -tacx1 0000000 L i m s n cr nl L i m ≤ n \r \n 4c 69 6d f3 6e 0d 0a 0000007 $ perl -le" binmode STDOUT, ':encoding(cp437)'; print for @ARGV" Limón + |od -tacx1 0000000 L i m " n cr nl L i m ó n \r \n 4c 69 6d a2 6e 0d 0a 0000007 $ perl -le" binmode STDOUT, ':encoding(UTF-8)'; print for @ARGV" Limón + |od -tacx1 0000000 L i m C 3 n cr nl L i m ├ │ n \r \n 4c 69 6d c3 b3 6e 0d 0a 0000010 $
        I am opening the file
        open (FILEHANDLE, ">:encoding(iso-8859-1)", "$files");
        But still the problem doesn't solve.