in reply to LWP::Simple returns strange encodings

See the content-type header:
> HEAD http://www.georgewbush.com/News/Read.aspx?ID=2768 200 OK Cache-Control: private Connection: close Date: Thu, 24 Jun 2004 19:09:36 GMT Server: Microsoft-IIS/5.0 Content-Length: 55012 Content-Type: text/html; charset=utf-8 Client-Date: Thu, 24 Jun 2004 19:14:04 GMT Client-Peer: 65.172.163.121:80 Client-Response-Num: 1 Set-Cookie: ASP.NET_SessionId=kma4jj55oyrrn245ujknpdab; path=/ X-AspNet-Version: 1.1.4322 X-Powered-By: ASP.NET
It's UFT-8 encoded, and translating it back to ASCII might be lossy. See perlunicode and Encode::PerlIO for some pointers on how to handle unicode - you probably need a recent perl (at least 5.6, but preferably 5.8 or higher).

Joost.

Replies are listed 'Best First'.
Re^2: LWP::Simple returns strange encodings
by iburrell (Chaplain) on Jun 24, 2004 at 19:43 UTC
    I am guessing the special characters are smart quotes and similar stuff. Those will definitely be lossy if converted to ASCII from UTF-8. He could translate to CP 1252, which has those characters in a non-standard location. The characters probably came from that encoding in Microsoft products.

    Alternatively, he could replace the characters with ASCII equivalents. Smart quotes with normal quotes, en-dashes with hyphens, etc. This would be similar to the conversion that people do with the CP 1252 but using Unicode characters instead.