in reply to Re: extracting data from HTML
in thread extracting data from HTML

Okay, that seems to work HTML::TreeBuilder seems be more forgiving

however $tree->dump gives a lot of information, luckely _as_XML_intended looks more readable again

Now the next part... extracting the right pieces of information with XPath

some pieces will be quite easy, for example the title. Others will be from traversing a <TABLE>:
in the left colum there is a data description, like 'Author', in the right column the name, like 'Wall, L.' (sometimes inside the <a HREF=...>Author Name</a> which makes it a bit more complicated, for I only want the text)

my guess is to look for a text element in a <td> tag etc, that equals "Author" and then do something with the next sibling?

Replies are listed 'Best First'.
Re^3: extracting data from HTML
by Corion (Patriarch) on Jun 03, 2012 at 16:41 UTC
    All your scraping will always be specific to the page(s) you're scraping. Personally, I like to use CSS selectors, as they give results more quickly than fighting with XPath. Whenever CSS is not enough, I fall back to looking at the XPath expressions Firebug suggests me for the elements, and work from these.