jhanna has asked for the wisdom of the Perl Monks concerning the following question:

I have a perl cgi script that uses LWP to get a piece of a web page and show it on the current web page. Some characters, notablely ' " and -- get mysteriously garbled

Specifically " ' became GÇ£GÇÿ

Is this a unicode problem?

Obviously I can use s/// to fix the most common cases, but I'd like to get it right, because I may need to support non-English text as well.

Thanks for your suggestions!

Replies are listed 'Best First'.
Re: LWP gives funky characters
by ikegami (Patriarch) on Jan 24, 2007 at 23:35 UTC

    Two possibilities.

    • The page you are getting is formatted in the same encoding as the one you are generating, but you haven't told the browser which encoding this is.

      You'll need to do

      print $cgi->header(-type=>'text/html', -charset=>'UTF-8');

      The above will result in the addition following META element in your HTML document.

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    • The page you are getting is formatted using one encoding, and yours is formatted in another.

      Say the type of the page your are downloading is 'text/html; charset=UTF-8'.
      Say the type of the page your are generating is 'text/html; charset=iso-latin-1'.
      You'll need to do

      use Encode qw( decode encode ); print $cgi->header(-type=>'text/html', -charset=>'iso-latin-1'); my $utf8_html_from_src = ... my $html_from_src = decode('UTF-8', $utf8_html_from_src); my $html_to_send = process($html_from_src); my $latin_html_to_send = encode('iso-latin-1', $html_to_send); print($latin_html_to_send);

      In this example, the encoding used by the source can represent more characters than the encoding used to deliver the content. Some characters may appear as question marks. Doing it so this doesn't happen is harder.

      require LWP::UserAgent; my $ua = LWP::UserAgent->new; my $response = $ua->get($ARGV[0]); $_=$response->content;
      then I take the the resulting page and find the DIV of interest and extract the text. That text goes in the database. Later another page selects the text and puts it in a textarea.

      The page is coming in text/html; charset=UTF-8, and my destination page is also text/html; charset=UTF-8.

      I'll look at Encode / decode('UTF-8'... that might be exactly what I need.

      Thanks.

        Then it's the first possibility. You're missing:

        print $cgi->header(-type=>'text/html', -charset=>'UTF-8');

        You probably have

        print $cgi->header(-type=>'text/html');

        which is the same as

        print $cgi->header(-type=>'text/html', -charset=>'ISO-8859-1');

        That tells the browser you are using one character set when you are using another.

Re: LWP gives funky characters
by graff (Chancellor) on Jan 24, 2007 at 23:24 UTC
    It certainly sounds like a problem in which unicode is involved... but it might be better to call it a "mismatched encodings" problem. Whatever page is being pulled in by LWP, it is apparently using some sort of "smart" or "wide-character" variants for the quotes and dashes, and in order to do that, the page should be labeled as to the particular (non-ASCII) character encoding that it is using in order to represent these special characters.

    Meanwhile, your own "current" web page is probably specifying a different character encoding, and/or you are viewing the page with a browser that is forcing its display to use some particular encoding, and the result is a conflict (a mismatch) with the original data received via LWP, so you are seeing what happens when the characters are misinterpreted.

    It's also possible that your script may be doing certain "standard" operations on the data, via CPAN modules or your own code, and in the process, perl is doing some sort of "default, assumed-to-be-reasonable" conversion of the character encoding, again with the result that the special characters are being misinterpreted as something that they were not meant to be.

    If you can show the original url, or some of the relevant unmodified strings from that page, and/or some minimal snippet of your own code that produces this behavior, it would be more likely that we could pinpoint the issue(s) for you.

Re: LWP gives funky characters
by jhanna (Scribe) on Feb 02, 2007 at 04:46 UTC
    Turns up the solution for me was this:
    use Encode; require LWP::UserAgent; my $ua = LWP::UserAgent->new; my $response = $ua->get("$ARGV[0]"); $_=$response->content; $_=decode('utf-8',$_); $result=""; for $r (/<div class="result-text-style-normal">\s*(.*?)<\/div>/isg, /< +td\s+class\s*=\s*"?multipassage-box[^>]*>(.*?)<\/td>/isg) { #($r)=/<div class="result-text-style-normal">\s*(.*?)<\/div>/is; $r=~s/<sup>.*?<\/sup>//gi; $r=~s/<h[45]>.*?<\/h[45]>/ /gi; $r=~s/<(.*?)>//g; $r=~s/&nbsp;/ /gi; $r=~s/\s*(.*?)\s*$/$1/s; $r=~s/\s+/ /g; $r=~s/([\x{0080}-\x{ffff}])/'\\u'.sprintf('%04x',ord($1))/ge; $result.= ($result?' ':'') . $r; }
    The main points being: (1) I needed to use Encode and decode('utf-8',$_) my response content, and (2) I replaced high characters with their javascript quoted version. This way I can completely avoid unicode in the database and all encoding transmission issues, and just un-quote them in the AJAX client.