tomhukins has asked for the wisdom of the Perl Monks concerning the following question:
I discovered that several sites I'm interested in sometimes output badly formed RSS. Special characters are often unescaped, which causes problems for XML::RSS and any other strict XML parser.
I've written a simple heuristic to parse badly formed XML, which might be of use to others. Maybe you can help me improve this?
use XML::RSS (); my $rss_content = "RSS goes here"; my @replace = ( ['\&(\s)', '"&$1"'], # '&' followed by space ['(\s)>', '"$1>"'], # '>' with preceding space ['\&', '"&"'], # All '&' ); my($data, $rss); PARSE: while (my $repl = shift @replace) { $rss = XML::RSS->new; $data = eval { $rss->parse($rss_content) } and last PARSE; } continue { $content =~ s/$repl->[0]/eval($repl->[1])/ge; } die unless $data;
I've only used this with XML::RSS, but this technique could be applied to XML::Parser, or any other XML parsing code.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
(jcwren) Re: Parsing badly formed RSS or XML
by jcwren (Prior) on Feb 22, 2001 at 21:11 UTC | |
|
Re: Parsing badly formed RSS or XML
by davorg (Chancellor) on Feb 22, 2001 at 21:22 UTC | |
by tomhukins (Curate) on Feb 22, 2001 at 21:36 UTC | |
by davorg (Chancellor) on Feb 22, 2001 at 21:43 UTC | |
by tomhukins (Curate) on Feb 22, 2001 at 22:08 UTC | |
|
Re: Parsing badly formed RSS or XML
by mirod (Canon) on Feb 22, 2001 at 21:30 UTC |