RipHard has asked for the wisdom of the Perl Monks concerning the following question:

I've got a program that downloads web-pages using
LWP::Simple. However special characters get messed up

Code, excerpt:

use LWP::Simple;<br> $webObject = &get($URL);<br> print $webObject;<br>

The target web-page contains the tex: "æ, ø, å"
The result when printing out is: "æ, ø å";

Obviously some charset problems. How can I fix this?


Thanks...

Replies are listed 'Best First'.
Re: LWP::Simple // Special Character problems.
by ikegami (Patriarch) on May 24, 2007 at 02:21 UTC

    There are two key questions that need to be answered.

    • What's the encoding of the input data?
    • In what encoding should the output data be?

    In this case, that means

    • What's the encoding of the downloaded content?
    • What encoding does your terminal use?

    Taking some guesses, we end up with:

    use Encode qw( decode encode ); use LWP::Simple qw( get ); my $URL = ...; my $web_enc = 'UTF-8'; my $out_enc = 'iso-latin-1'; my $web_octets = get($URL); my $chars = decode($web_enc, $web_octets); my $out_octets = encode($out_enc, $chars); print($out_octets);

    Note: from_to could be used as a shortcut for decode plus encode.

    Ok, I kinda lied. In this case, it's not necessary to know the encoding of the downloaded content because the web server should tell us what it is.

    use LWP::UserAgent qw( ); my $URL = ...; my $out_enc = 'iso-latin-1'; # Saves us from encoding chars sent to STDOUT. binmode(STDOUT, ":encoding($out_enc)"); my $ua = LWP::UserAgent->new(); my $response = $ua->get($URL); my $chars = $response->decoded_content(default_charset => 'UTF-8'); print($chars);

    Update: Non-core module Term::Encoding can help determine the value for $out_enc..

Re: LWP::Simple // Special Character problems.
by graff (Chancellor) on May 24, 2007 at 01:08 UTC
    You'll need to experiment with a few different things until you find the one that works... The first thing I'd probably try is:
    ... binmode STDOUT, ":utf8"; print $webObject;
    Then of course you have to make sure you have an appropriate method for actually viewing the output -- a utf8-aware terminal window, a browser set to utf8 encoding, etc.

    If that doesn't work, and you don't know what else to try, reply back with some more details (an actual url, what you are using to view the data, etc).

    (update: sometimes it's hard to work out the logic of what is going wrong... another thing to try, if the above does not work, is:

    use Encode; ... $webObject = decode( "utf8", get( $URL )); ...
    If you try that without the "binmode" thing, you should get warnings about "wide character in print...", and doing binmode STDOUT,":utf8"; should make those warnings go away.)
      binmode STDOUT, ":utf8"; print $webObject;

      would encode to UTF-8 content that appears to already be UTF-8 encoded. The idea is to convert *from* UTF-8, but you are converting *to* UTF-8. My reply shows some implementations.

Re: LWP::Simple // Special Character problems.
by andreas1234567 (Vicar) on May 24, 2007 at 07:31 UTC
    The target web-page contains the tex: "æ, ø, å"
    That indicates that the web-page in question did not properly encode unsafe characters using HTML::Entities or equivalent, right? But that probably not your fault, unless of course it's your own site. But you can decode it yourself:
    use strict; use warnings; use LWP::Simple; use HTML::Entities; my $str = decode_entities(get(q{http://www.uio.no})); my @arr = split('\s+', $str); for (@arr) { print if (m/[æøå]/i); } __END__ største ønsker å søk--> <!--Søk søkeknapp alt="Søk" value="Søk" Walløe forskingsråd</a><span å nivået ...
    Update: Please disregard this post. It is wrong and misleading.

    Andreas
    --
      There's nothing unsafe about those characters. "A" is just as safe/unsafe. Your code *happens* to work in this specific case, but will not work for all encodings. It'll fail for UTF-16, for example.
        Hi. Thanks to you all for great feedback. I've gotten this working now. Cheers, Fro