Don't look for a particular general module that will solve all your HTML to Data problems. Look at the page or pages that you want to extract data from, and figure out what are the best modules for those particular cases. In my experience (which is less than most others here), it's not worth the trouble to find something that will go straight from HTML to appropriately structured XML. Whoever generated the page had some database model and spewed it into some template that they invented, probably with no thought whatsoever in making it easy to turn it back into data. Or they didn't even do things in a consistent way, making your problem in inverting it even worse.
If you have access to a lot of O'Reilly stuff, don't look at the general books. Look at a practical one--I started HTML scraping with recipes out of Spidering Hacks and still refer back to it occasionally.
Here's a recent example (after the more tag) where I had a bunch of pages on a website that I wanted to copy book metadata from all the pages and put it into XML so I could generate a catalog from the XML. The catch is that the pages were all hand coded. They did a pretty good job using CSS to identify the relevant parts, but there were still inconsistencies, and a few of the older pages were so out of whack that they didn't get processed at all.
If you look at the code, it's pretty specific to the pages I was scraping, so it's ugly in all sorts of ways. It could also be made somewhat simpler if I needed to do it a bunch more times-- it's a bit repetitive in pulling out a bunch of the labeled items, so those could be a loop through an array of names, and maybe add flags to the array for special treatment. There are also extraneous modules called-- the original pages were inconsistent about odd characters and entities, and that was one of the bigger headaches. Note how I find the pieces I want-I know how they're named, so I just do a "look down" to find them, and then process contents from there. Note also that I use XML::Writer to generate the XML, rather than trying to do it myself.
|