LWP charset problem

roddik has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

I'm using LWP::UserAgent to get some pages from the web, by sometimes the decoded_content method of HTTP::Message crashes with an error 'utf8 "\x96" does not map to Unicode'. One of such pages is ttp://acus.org/new_atlanticist/sarkozy-delays-university-reforms-feared-greek-style-riots. Seems to me, it's because some of the characters, found on the page don't map to encoding, selected in decoded_content sub. If I use decoded_content(charset => 'none') (which basically means leaving the content encoded), the error disappears, but what I get is the message in an unknown encoding. For me some malformed characters don't matter, so a possible workaround would be get the body with decoded_content(charset => 'none') and than decode it itself using PERLQQ and at last use regex to delete them, but I don't want to copy the code reliable for encoding guessing from http::message. How would you suggest to solve this? TIA

PS: not sure if everything is clear, just try to fetch the mentioned page and use decoded_content on it

Comment on LWP charset problem

Replies are listed 'Best First'.
Re: LWP charset problem by zentara (Cardinal) on Jan 05, 2009 at 17:28 UTC
Maybe LWP and UTF-8 will yield a clue? I'm not really a human, but I play one on earth Remember How Lucky You Are	[reply]
Re^2: LWP charset problem by roddik (Initiate) on Jan 05, 2009 at 18:03 UTC
Thanks for the suggestion, however that topic handles a bit different kind of a problem. I decided to use a snippet like `sub decoded_content { my $self = shift; my $c; eval {$c = $self->SUPER::decoded_content(raise_error => 1)} && ret +urn $c; $c = $self->SUPER::decoded_content(charset => 'none'); $c = decode('utf8', $c, Encode::FB_PERLQQ()); $c =~ s/\\x\d{,2}//g; $c; }` [download] to get what I want, now the question is how to tell LWP to return a response object using this sub?	[reply] [d/l]
Re^3: LWP charset problem by zwon (Abbot) on Jan 05, 2009 at 18:58 UTC
I've checked your link (http://acus.org/new_atlanticist/sarkozy-delays-university-reforms-feared-greek-style-riots) and it really contains malformed utf8 character with 0x96 code. This message can't be correctly decoded, that's why `decoded_content` fails. Also you explicitly requested to raise_error if it's not able to decode message. Try instead to get content using `HTTP::Message::content` and decode it using `Encode::decode`.	[reply]
Re^3: LWP charset problem by zentara (Cardinal) on Jan 05, 2009 at 18:20 UTC
I'm not sure what your code looks like, but LWP has a callback mechanism, that is usually used for monitoring progress. Possibly you can use it, to decode your content. It is up to you to open a file and write the data, as it comes in; possibly you could filter it there. #!/usr/bin/perl -w use strict; use LWP::UserAgent; # don't buffer the prints to make the status update $\| = 1; my $ua = LWP::UserAgent->new(); my $received_size = 0; my $url = 'http://www.cpan.org/authors/id/J/JG/JGOFF/parrot-0_0_7.tgz' +; print "Fetching $url\n"; my $request_time = time; my $last_update = 0; my $response = $ua->get($url, ':content_cb' => \&callback, ':read_size_hint' => 8192, ); print "\n"; sub callback { my ($data, $response, $protocol) = @_; my $total_size = $response->header('Content-Length') \|\| 0; $received_size += length $data; ############################################3 # Here you write the $data to a filehandle or whatever should happen # with it here, like do your decoding. ###########################################3 my $time_now = time; # this to make the status only update once per second. return unless $time_now > $last_update or $received_size == $total_s +ize; $last_update = $time_now; print "\rReceived $received_size bytes"; printf " (%i%%)", (100/$total_size)*$received_size if $total_size; printf " %6.1f/bps", $received_size/(($time_now-$request_time)\|\|1) if $received_size; } [download] I'm not really a human, but I play one on earth Remember How Lucky You Are	[reply] [d/l]