alexg has asked for the wisdom of the Perl Monks concerning the following question:
The XML::Parser error message generated is:
"not well-formed (invalid token) at line 15, column 19, byte 530"
when I look at byte 530 it turns out to be the 'é' in Nescafé. Other exotic characters also cause XML::Parser to stop dead. I've tried the nice_string function from the Unicode man page:
sub nice_string { join("", map { $_ > 255 ? # if wide character... sprintf("\\x{%04X}", $_) : # \x{...} chr($_) =~ /[[:cntrl:]]/ ? # if control character sprintf("\\x%02X", $_) : # \x.. chr($_) # else as themselves } unpack("U*", $_[0])); # unpack Unicode }
but this enrages XML::Parser even further and it fails and the first end-of-line character. I'm using LWP::Simple to grab the XML so my script essentially looks like this:
my $rss = new XML::RSS; eval { $rss->parse(nice_string(get($url))); };
Can anyone recomend a module/function that will reliably sanitise the string that get() returns, in an encoding suitable for XML::Parser?
PS I've looked at the 'encoding' option for XML::Parser but it doesn't seem to change the result :(
|
---|
Replies are listed 'Best First'. | |
---|---|
•Re: XML::RSS
by merlyn (Sage) on Mar 21, 2003 at 11:36 UTC | |
by alexg (Beadle) on Mar 21, 2003 at 12:12 UTC | |
by rjray (Chaplain) on Mar 21, 2003 at 23:26 UTC | |
Re: XML::RSS
by zby (Vicar) on Mar 21, 2003 at 11:29 UTC | |
by alexg (Beadle) on Mar 21, 2003 at 11:53 UTC | |
by zby (Vicar) on Mar 21, 2003 at 12:05 UTC | |
Re: XML::RSS
by AnthonyLewis (Novice) on Mar 23, 2003 at 07:40 UTC | |
Re: XML::RSS
by ajt (Prior) on Mar 24, 2003 at 21:40 UTC |