Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am retrieving news items posted on the presidential campaign web sites. I get a page with LWP::Simple, then I parse it with HTML::Parser to extract the text portions.

This works fine for Kerry's site but when I get/parse a page from Bush's site I get some of the characters like apostrophes, quotes, dashes, etc., in an odd encoding. Most display as three character sequences beginning with a-hat, as in a-hat euro-sign vertical bar. There are also some single character things like a dotted cap-A, etc.

In the source code for the page, they seem to be normal characters or HTML tags. See example at http://www.georgewbush.com/News/Read.aspx?ID=2768

Does anyone know what these characters are and how to translate them into regular ascii characters before I get tham back from HTML::Parser?

Many thanks.... Steve

Replies are listed 'Best First'.
Re: LWP::Simple returns strange encodings
by Joost (Canon) on Jun 24, 2004 at 19:13 UTC
    See the content-type header:
    > HEAD http://www.georgewbush.com/News/Read.aspx?ID=2768 200 OK Cache-Control: private Connection: close Date: Thu, 24 Jun 2004 19:09:36 GMT Server: Microsoft-IIS/5.0 Content-Length: 55012 Content-Type: text/html; charset=utf-8 Client-Date: Thu, 24 Jun 2004 19:14:04 GMT Client-Peer: 65.172.163.121:80 Client-Response-Num: 1 Set-Cookie: ASP.NET_SessionId=kma4jj55oyrrn245ujknpdab; path=/ X-AspNet-Version: 1.1.4322 X-Powered-By: ASP.NET
    It's UFT-8 encoded, and translating it back to ASCII might be lossy. See perlunicode and Encode::PerlIO for some pointers on how to handle unicode - you probably need a recent perl (at least 5.6, but preferably 5.8 or higher).

    Joost.

      I am guessing the special characters are smart quotes and similar stuff. Those will definitely be lossy if converted to ASCII from UTF-8. He could translate to CP 1252, which has those characters in a non-standard location. The characters probably came from that encoding in Microsoft products.

      Alternatively, he could replace the characters with ASCII equivalents. Smart quotes with normal quotes, en-dashes with hyphens, etc. This would be similar to the conversion that people do with the CP 1252 but using Unicode characters instead.

Re: LWP::Simple returns strange encodings
by cormanaz (Deacon) on Jun 24, 2004 at 20:16 UTC
    The conversion can be done with module Text::Iconv. The following worked on my machine (Win XP), though I found you have to be very careful to give the precise character set names in line 4.
    use strict; use Text::Iconv; my $unicodetext = <IN>; # or whatever my $utf2ascii= new Text::Iconv( 'UTF-8', 'ASCII') or die "Can't make c +onverter"; my $asciitext = $utf2ascii->convert($text); print "$asciitext\n";