Hi Monks;-)
After spending some night(mare)s with encoding madness, i finally got a nearly working solution for the following task:
- get english and german xml-content via LWP::UserAgent from a page in utf-8 or ISO-8859-1.
- parse data with XML::Parser and do something stupid with it.
- save data into an utf-8 encoded database.
i set up agent (package var $XAgent) like this:
# -----------------------------------------------------
sub _setAgent
{
$XAgent = LWP::UserAgent->new(keep_alive => 1);
$XAgent->default_header('Accept-Charset'
=> 'ISO-8859-1,utf-8');
$XAgent->agent($ENV{'HTTP_USER_AGENT'});
$XAgent->cookie_jar({}); # allow cookies
};
i get data like this:
# -----------------------------------------------------
sub _getUrl
{
return $XAgent->get(shift)->decoded_content();
};
i request XML::Parser to parse data like that:
# -----------------------------------------------------
my $xml = _getUrl($url);
my $p = XML::Parser->new(Style => 'Stream', Pkg => 'some_pkg',
ProtocolEncoding => "utf-8");
$p->parse($xml);
If I leave out the 'utf-8'-hint for XML::Parser, some non-ascii chars get screwed up, so I thought
"alright, decoded_content() returns perl-friendly utf-8, so set that manually!" Works almost everytime.
Almost. So im not quite shure if i'm right about that assumption.
So my questions are:
- IS the return of decoded_content() utf-8?
if not: how do i force it to return utf-8 even if the server send me ISO-8859-1?
- And if i do something like return decoded_content AND charset from _getUrl and pass charset as protocol-encoding into XML::Parser: do i end of with still in ISO-8859-1 endcoded strings inside my some_pkg-handlers?
Any further enlightenment and - of course - hits how do things better/faster highly welcome
TIA, ~.rhavin;)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.