Unicode With LWP

by zer
Good evening, I am running into a wierd bump with some older code of mine. What the following code does is behave as a proxy on a website changing minor things. The problem it runs into is left and right double quotes which appear on certain websites. When it spits them out it isnt the same;
#!/usr/bin/perl use strict; use warnings; use CGI; binmode(STDOUT,":utf8"); #it doesnt print right with it here or not my $t=qq(<img border="0px" src="" alt="T.png" +/>); my $req = CGI->new(); use LWP::Simple; my $content= get($req->param('go')); $content=~s|<div id="jump-to-nav">.*?</div>||g; $content=~s|<div class="printfooter".*|</body></html>|msg; $content=~s|"(/tmpwiki/phase3)|"$1|g; $content=~s|<head>.*</head>|<head><meta http-equiv="Content-Type" cont +ent="text/html;charset=utf-8" /><title>BTT</title></head>|msg; $content=~s|(\<img)|$1 border="0px" |msg; $content=~s|BTT400:||; $content=~s|$t.*?$t||msg; if ($req->param("f")){ print "Content-type: application/x-download\nContent-Disposition:attac +hment;filename=\"Business Requirement.doc\"\n\n".$content; }else{print $req->header().$content;}

Re: Unicode With LWP
by Juerd on Jan 20, 2008

    LWP::Simple gives you the content as a byte string, ignoring the charset attribute in the Content-Type header. If you want to pass the data along without decoding it, you will have to use the same charset that your source used, but LWP::Simple didn't provide it.

    You could find it out manually, hard code it, and hope they'll never change it. Or you could hop from LWP::Simple to a more advanced module, like full LWP. My favourite way of doing this is to use decoded_content and then explicitly re-encode as UTF-8 for output, because I like to standardize on UTF-8 for web stuff.

