Re: extracting data from HTML

Replies are listed 'Best First'.
Re^2: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 03, 2012 at 12:31 UTC
Hi, the point is to get it into something I can handle with XPath and do some 'foreach' if needed and yes, I've read most of O'Reilly: XML Perl & XML XML Schema XSLT XSLT cookbook And that is the reason I trun to the monnestry, for the answers are not to be found in those scrolls	[reply]
Re^3: extracting data from HTML by bitingduck (Deacon) on Jun 04, 2012 at 04:00 UTC
Don't look for a particular general module that will solve all your HTML to Data problems. Look at the page or pages that you want to extract data from, and figure out what are the best modules for those particular cases. In my experience (which is less than most others here), it's not worth the trouble to find something that will go straight from HTML to appropriately structured XML. Whoever generated the page had some database model and spewed it into some template that they invented, probably with no thought whatsoever in making it easy to turn it back into data. Or they didn't even do things in a consistent way, making your problem in inverting it even worse. If you have access to a lot of O'Reilly stuff, don't look at the general books. Look at a practical one--I started HTML scraping with recipes out of Spidering Hacks and still refer back to it occasionally. Here's a recent example (after the more tag) where I had a bunch of pages on a website that I wanted to copy book metadata from all the pages and put it into XML so I could generate a catalog from the XML. The catch is that the pages were all hand coded. They did a pretty good job using CSS to identify the relevant parts, but there were still inconsistencies, and a few of the older pages were so out of whack that they didn't get processed at all. If you look at the code, it's pretty specific to the pages I was scraping, so it's ugly in all sorts of ways. It could also be made somewhat simpler if I needed to do it a bunch more times-- it's a bit repetitive in pulling out a bunch of the labeled items, so those could be a loop through an array of names, and maybe add flags to the array for special treatment. There are also extraneous modules called-- the original pages were inconsistent about odd characters and entities, and that was one of the bigger headaches. Note how I find the pieces I want-I know how they're named, so I just do a "look down" to find them, and then process contents from there. Note also that I use XML::Writer to generate the XML, rather than trying to do it myself. Read more... (12 kB)	[reply] [d/l]
Re^2: extracting data from HTML by Jurassic Monk (Acolyte) on Jun 03, 2012 at 12:57 UTC
nice XML... I probably should have said: "well-formed", not even bothering to have it "valid XML", for most of the websites don't produce X-HTML and therfor making it trouble to just read in the source and return the XML-object, hence my question	[reply]