in reply to extracting data from HTML

  1. open it? open
  2. retrieve? WWW::Mechanize or LWP::UserAgent or threads here that satisfy search-terms like "fetch" and "html" (or you could open your source in a browser and 'save as')
  3. get nice XML? You may want to understand XML before raising this question... See w3.org/TR/rec-xml if you don't have a pretty good handle on what "eXtensible markup language" is... and search out xml parser here, if you do. See also nodes here satisfying a searchterm like "parse."

Replies are listed 'Best First'.
Re^2: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 03, 2012 at 12:31 UTC

    Hi,

    the point is to get it into something I can handle with XPath and do some 'foreach' if needed

    and yes, I've read most of O'Reilly:

    • XML
    • Perl & XML
    • XML Schema
    • XSLT
    • XSLT cookbook

    And that is the reason I trun to the monnestry, for the answers are not to be found in those scrolls

      Don't look for a particular general module that will solve all your HTML to Data problems. Look at the page or pages that you want to extract data from, and figure out what are the best modules for those particular cases. In my experience (which is less than most others here), it's not worth the trouble to find something that will go straight from HTML to appropriately structured XML. Whoever generated the page had some database model and spewed it into some template that they invented, probably with no thought whatsoever in making it easy to turn it back into data. Or they didn't even do things in a consistent way, making your problem in inverting it even worse.

      If you have access to a lot of O'Reilly stuff, don't look at the general books. Look at a practical one--I started HTML scraping with recipes out of Spidering Hacks and still refer back to it occasionally.

      Here's a recent example (after the more tag) where I had a bunch of pages on a website that I wanted to copy book metadata from all the pages and put it into XML so I could generate a catalog from the XML. The catch is that the pages were all hand coded. They did a pretty good job using CSS to identify the relevant parts, but there were still inconsistencies, and a few of the older pages were so out of whack that they didn't get processed at all.

      If you look at the code, it's pretty specific to the pages I was scraping, so it's ugly in all sorts of ways. It could also be made somewhat simpler if I needed to do it a bunch more times-- it's a bit repetitive in pulling out a bunch of the labeled items, so those could be a loop through an array of names, and maybe add flags to the array for special treatment. There are also extraneous modules called-- the original pages were inconsistent about odd characters and entities, and that was one of the bigger headaches. Note how I find the pieces I want-I know how they're named, so I just do a "look down" to find them, and then process contents from there. Note also that I use XML::Writer to generate the XML, rather than trying to do it myself.

Re^2: extracting data from HTML
by Jurassic Monk (Acolyte) on Jun 03, 2012 at 12:57 UTC

    nice XML...

    I probably should have said: "well-formed", not even bothering to have it "valid XML", for most of the websites don't produce X-HTML and therfor making it trouble to just read in the source and return the XML-object, hence my question