qq has asked for the wisdom of the Perl Monks concerning the following question:
OT, but I will solve it with perl if I solve it at all.
Is there any way to make sense of an RSS file thats been incorrectly encoded? The following line appears in this feed.
<title>BCCI confirms India A&Acirc;Â’s Zimbabwe tour</title>
Which should read: "India A's Zimbabwe tour"
This being RSS with no encoding specified, its officially utf8. But it's obviously been html-encoded at some point.
Can anybody explain to me how to figure out what happended to the string above? Can it be undone?
Or, as is my current inclination, should I just exclude any feeds/items that I find html-entities in? I've a lot of feeds to parse, so I need a solution that doesn't require hand examining each feed.
thanks again, qq
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Unmangle RSS encodings
by hardburn (Abbot) on Jun 30, 2004 at 20:49 UTC | |
by qq (Hermit) on Jun 30, 2004 at 21:35 UTC | |
|
Re: Unmangle RSS encodings
by pbeckingham (Parson) on Jun 30, 2004 at 20:46 UTC | |
|
Re: Unmangle RSS encodings
by theorbtwo (Prior) on Jul 01, 2004 at 04:28 UTC | |
by qq (Hermit) on Jul 01, 2004 at 05:00 UTC | |
|
Re: Unmangle RSS encodings
by iburrell (Chaplain) on Jul 01, 2004 at 16:40 UTC |