UTF8 and LWP::Simple;

MattLG has asked for the wisdom of the Perl Monks concerning the following question:

I've discovered another problem with UTF8 that I don't understand. I'm still using Perl 5.8 (sorry, my ISP still hasn't upgraded). I'm basically just using LWP::Simple to read in some UTF8, process it and then output it. But it's coming out as gobbledegook again.

Here is the minimal code that I can reproduce the problem with:

#!/usr/bin/perl -CS
use LWP::Simple;
print get(...);
[download]

I'm using http://feeds.feedburner.com/breakingtravelnews as the source of UTF8 data.

If I don't use -CS, it works fine, but then this breaks other things in the code that need these switches.

I could understand the problem if I was just using -CI or -CO, instead of -CS (ie. both input and output). How does inputting UTF8 then outputting it straight away AS UTF8, not work?

Cheers

MattLG

Comment on UTF8 and LWP::Simple; Download Code

Replies are listed 'Best First'.
Re: UTF8 and LWP::Simple; by moritz (Cardinal) on Apr 25, 2010 at 16:09 UTC
`-CS` tells Perl to decode characters coming from STDIN, and encode characters going to STDOUT - but LWP::Simple retrieves its information via a socket, not STDIN. The solution is to use LWP::UserAgent and then print `$response->decoded_content`.	[reply] [d/l] [select]
Re^2: UTF8 and LWP::Simple; by MattLG (Beadle) on Apr 25, 2010 at 17:19 UTC
D'oh! Of course. Thanks MattLG	[reply]
Re: UTF8 and LWP::Simple; by ikegami (Patriarch) on Apr 25, 2010 at 16:15 UTC
`get` returns the document as returned by the web server. It doesn't decode the document into text if it happens to be text. By using -C, you are encoding encoded text. Either decode the document (`->decoded_content` if use you were using LWP::UserAgent) and re-encode via `-C` (which allows you to convert from the remote encoding to your local encoding), or `binmode` STDOUT (to disable `-C` and leave the document unchanged)	[reply] [d/l] [select]