Unmangle RSS encodings

qq has asked for the wisdom of the Perl Monks concerning the following question:

OT, but I will solve it with perl if I solve it at all.

Is there any way to make sense of an RSS file thats been incorrectly encoded? The following line appears in this feed.

    <title>BCCI confirms India A&amp;Acirc;Ā’s Zimbabwe tour</title>
[download]

Which should read: "India A's Zimbabwe tour"

This being RSS with no encoding specified, its officially utf8. But it's obviously been html-encoded at some point.

Can anybody explain to me how to figure out what happended to the string above? Can it be undone?

Or, as is my current inclination, should I just exclude any feeds/items that I find html-entities in? I've a lot of feeds to parse, so I need a solution that doesn't require hand examining each feed.

thanks again, qq

Comment on Unmangle RSS encodings Download Code

Replies are listed 'Best First'.
Re: Unmangle RSS encodings by hardburn (Abbot) on Jun 30, 2004 at 20:49 UTC
If the RSS server isn't holding up its end of the bargin to send UTF-8, then you're probably screwed. There is a guarentee of functionality, and it wasn't upheld. I doubt any automatic solution will do the job. ---- send money to your kernel via the boot loader.. This and more wisdom available from Markov Hardburn.	[reply]
Re^2: Unmangle RSS encodings by qq (Hermit) on Jun 30, 2004 at 21:35 UTC
thanks to you and pbeckingham. I'll test for the presence of html-entities with a regex and trash the item if I find any.	[reply]
Re: Unmangle RSS encodings by pbeckingham (Parson) on Jun 30, 2004 at 20:46 UTC
I think you're sunk. Too many trips through the mangle. However, given: `<title>BCCI confirms India A&Acirc;Ā’s Zimbabwe tour</title>` [download] With some bogus heuristics, you could probably clean it up somewhat to this: `<title>BCCI confirms India A’s Zimbabwe tour</title>` [download] But that's not a good idea, and not correct, either.	[reply] [d/l] [select]
Re: Unmangle RSS encodings by theorbtwo (Prior) on Jul 01, 2004 at 04:28 UTC
I'd tend to say that you can do something quite simple: If your XML parser refuses to parse it, throw it out. That's how XML is supposed to work. OTOH, a quick glance at that URL in mozilla does not give a parse error, so it must not be invalid XML. (Note that it being invalid utf8 would automatically make it invalid XML.) At this point, you have two choices, both valid, on what to do about it. You can either display it just like it says to display it: AÂĀ’s and all -- which is what the feed tells you to do -- or you can try and make it look more or less like what it's supposed to be. If you want to do the later, try this: Your XML parser should already decode the "outer" (XML) layer of the double-encoding, and give you `AÂĀ’s`. Run a simple `m/&\w+;/` against it. If that matches, it's probably been HTML-encoded before being XML-encoded. Run it though HTML::Entities to HTML-decode it. Note that this will get it down to "A&ÂĀ’s", which may or may not have been what they originally intended (I suspect they meant to say A&Â's, and the extra Ā came from somewhere completely random). It is, however, as close as you can reasonably get. Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply] [d/l] [select]
Re^2: Unmangle RSS encodings by qq (Hermit) on Jul 01, 2004 at 05:00 UTC
I'm going to throw it out. It _is_ valid xml, and its it is _valid_ utf8 - but its still garbage. The final result is supposed to be, simply, "A's" (I checked the link the rss item points to). I don't think the extra Â is exactly random, however, and I'm sure if I really understood unicode and encodings I could at least make a reasonable guess at what has been done to that poor string. One day... qq	[reply]
Re: Unmangle RSS encodings by iburrell (Chaplain) on Jul 01, 2004 at 16:40 UTC
The text has been encoded too many times, with UTF-8 character encoded as bytes, translated to HTML entities, and then XML entities (unless Perl Monks added the `&`). It might be possible to reverse the mangling. The hard part is figuring out what transformations to apply, because presumbly the rest of the file does not have the problem. Personally, I say follow the standards and decode the file based on the encoding in the xml header, and the XML entities. The corruption is the providers fault. If everyone follows the rules, then corruption becomes less likely.	[reply] [d/l]