LWP::Curl and character encoding

perl-diddler has asked for the wisdom of the Perl Monks concerning the following question:

Has anyone experience with differentiating character encoding's that ones needs to save pages that are returned by LWP::Curl?

I.e, lets say for simplicities' sake, that I want to be able able to specify a URL, and have it fetched into memory, and then saved.

I'm running into a bit of a dilemma -- when I try to treat the contents as UTF-8 -- that works fine for the pages I'm fetching (that happen to use the XHTML standard UTF-8), but it definitely doesn't work when I save binary files.

When I tried to fetch things as binary, that didn't work and I ended up with weird diamond-shape marks where quotes should be (a 'feature' of UTF-8 being misinterpreted as western).

Unfortunately, I can't tell the type from the file name, as some files are simply "site/get?item=xxx, where xxx could return text or an image.

I'm not at wits end on this yet, but in trying to trim down some verbose output, I hit on another problem that I've already posed a Q on here .. so while waiting for some ideas on that, I thought I'd try to pick people's brains a bit rather than just dash my head against documentation and various functions until I have a breakthrough or a headache and THEN end up here...(i.e. w/insight, I might save myself some time!) :-)

thanks...

Comment on LWP::Curl and character encoding

Replies are listed 'Best First'.
Re: LWP::Curl and character encoding by Anonymous Monk on Nov 16, 2010 at 11:58 UTC
Unfortunately, I can't tell the type from the file name, as some files are simply "site/get?item=xxx, where xxx could return text or an image. That is what headers are for. See HTTP::Response->decoded_content	[reply]
Re^2: LWP::Curl and character encoding by perl-diddler (Chaplain) on Nov 16, 2010 at 12:34 UTC
Oh...um... I'm using LWP->Curl->get(...)... There's no response to parse... I take it that using 'get' is out....and I need to just not use the Curl interface? sigh...	[reply]
Re^3: LWP::Curl and character encoding by Anonymous Monk on Nov 16, 2010 at 12:55 UTC
I take it that using 'get' is out....and I need to just not use the Curl interface? I don't see why that would be the case. Simply RTFM so you can get at the headers	[reply]
Re^4: LWP::Curl and character encoding by perl-diddler (Chaplain) on Nov 16, 2010 at 13:16 UTC
Re^5: LWP::Curl and character encoding by Anonymous Monk on Nov 16, 2010 at 13:31 UTC
Some notes below your chosen depth have not been shown here
Re^2: LWP::Curl and character encoding by perl-diddler (Chaplain) on Nov 16, 2010 at 13:28 UTC
I see under HTTP::Response, the options to retrieve raw content or decoded content. But there doesn't seem to be a $response->type, where type returns something to indicate whether I should "decode the content" (i.e. an HTML file), or should get the 'raw' content' (i.e. it's binary and decoding it would mess it up). If I call content, then I get a raw, but HTML isn't decoded to UTF-8, but if I call decoded content, then it says: "Returns the content with any Content-Encoding undone and the raw content encoded to perl's Unicode strings." By encoding a 'gif' file into UTF-8, it will corrupt the file. So this doesn't seem to help, but seems to be stuck at the same place I was before -- knowing the content of the the buffer before looking at it! lame!... This seems so basic -- there has to be a way to properly retrieve a URI, and process it or store it, unprocessed -- I'm just missing it...since it seems like a very common problem, I can't imagine that everyone goes and starts hacking headers and trying to figure out what combination of 'use utf8/local/bytes'...etc, is necessary to make sense of this ...	[reply]
Re^3: LWP::Curl and character encoding by Anonymous Monk on Nov 16, 2010 at 13:37 UTC
http://search.cpan.org/grep?cpanid=GAAS&release=libwww-perl-5.837&string=decoded&i=1&n=1&C=0	[reply]
Re^4: LWP::Curl and character encoding by perl-diddler (Chaplain) on Nov 16, 2010 at 14:09 UTC
Re^5: LWP::Curl and character encoding by Anonymous Monk on Nov 16, 2010 at 14:35 UTC