wmfs has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a problem with LWP::UserAgent; not correctly interpreting certain characters from an external HTML file. This is the code I am using
use LWP::UserAgent; $ua = LWP::UserAgent->new; $url = "https://<SITE>/<FILE>.html"; $req = HTTP::Request->new(GET => $); $ua->agent('Mozilla/5.0'); $res = $ua->request($req); if ($res->is_success) { $result = $res->content; } else { print $htmlres->status_line; print $htmlres->decoded_content; }
Extract from the file on the internet
It is the coryphées should fear him,
Data retrieved
It is the coryph�es should fear him
Thank you for your help, Bill Seabrook UK

Replies are listed 'Best First'.
Re: LWP::UserAgent; unable to copy extended ASCII
by Corion (Patriarch) on Jun 06, 2022 at 09:49 UTC

    What you see is the Unicode replacement character that is printed whenever the target cannot accept the character. You should maybe also see some warning about "wide chars" from Perl.

    Did you tell Perl that you want to output Unicode to STDOUT?

    binmode STDOUT, ':encoding(UTF-8)';

      Nitpick: "Unicode" is the name of the concept, not of a particular encoding. The character é can be represented in a variety of encodings, including good ol' ISO-8859-1. So... Perl doesn't complain about a wide character in this case.

      The remedy is correct as given: You want to print in the encoding which is understood by the print handle. This is indeed UTF-8 for "modern" Linux terminals.

      Thank you!
      binmode STDOUT, ':encoding(UTF-8)';
      solves my problem. I am very grateful. Bill S
Re: LWP::UserAgent; unable to copy extended ASCII
by Discipulus (Canon) on Jun 06, 2022 at 09:49 UTC
    Hello wmfs,

    your code lacks of the unavoidable use strict; use warnings; and the name of variables seems to change during the script..

    ..but anyway LWP::UserAgent retrieves correctly the é as you can see in this short oneliner output(be aware of windows double quotes):

    perl -MLWP::UserAgent -e "print LWP::UserAgent->new->get('http://perlm +onks.org/index.pl?node_id=11144443')->decoded_content" | grep cory It is the <em>coryphées</em> should fear him, It is the <em>coryph&#65533;es</em> should fear him

    L*

    PS I have semplified the above code, same results btw

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: LWP::UserAgent; unable to copy extended ASCII
by wmfs (Acolyte) on Jun 06, 2022 at 09:48 UTC
    Sorry, mis-typed line 3, it should read
    $req = HTTP::Request->new(GET => <b>$url</b> );