MattLG has asked for the wisdom of the Perl Monks concerning the following question:

I've discovered another problem with UTF8 that I don't understand. I'm still using Perl 5.8 (sorry, my ISP still hasn't upgraded). I'm basically just using LWP::Simple to read in some UTF8, process it and then output it. But it's coming out as gobbledegook again.

Here is the minimal code that I can reproduce the problem with:

#!/usr/bin/perl -CS use LWP::Simple; print get(...);

I'm using http://feeds.feedburner.com/breakingtravelnews as the source of UTF8 data.

If I don't use -CS, it works fine, but then this breaks other things in the code that need these switches.

I could understand the problem if I was just using -CI or -CO, instead of -CS (ie. both input and output). How does inputting UTF8 then outputting it straight away AS UTF8, not work?

Cheers

MattLG

Replies are listed 'Best First'.
Re: UTF8 and LWP::Simple;
by moritz (Cardinal) on Apr 25, 2010 at 16:09 UTC
    -CS tells Perl to decode characters coming from STDIN, and encode characters going to STDOUT - but LWP::Simple retrieves its information via a socket, not STDIN.

    The solution is to use LWP::UserAgent and then print $response->decoded_content.

      D'oh! Of course.

      Thanks

      MattLG

Re: UTF8 and LWP::Simple;
by ikegami (Patriarch) on Apr 25, 2010 at 16:15 UTC

    get returns the document as returned by the web server. It doesn't decode the document into text if it happens to be text. By using -C, you are encoding encoded text.

    Either

    • decode the document (->decoded_content if use you were using LWP::UserAgent) and re-encode via -C (which allows you to convert from the remote encoding to your local encoding), or
    • binmode STDOUT (to disable -C and leave the document unchanged)