Re: Sucking Data off a Web Page

~~jeffa~~ mojotoad wrote a really great scraping module, HTML::TableExtract, which easily scrapes an HTML table into an array of arrays, which you can then convert to a csv file again, or stuff it into DBI directly. For example, the following code tries to extract all rows from "the one table" on the page:

  my $te = HTML::TableExtract->new();
  $te->parse($html);
  foreach $row ($te->rows) {
      print join(',', @$row), "\n";
  }
[download]

The only problem there is with your table is, that it is not organized in columns but in rows, so you will have to flip the table.

Update: I realized that it was mojotoad, not jeffa who wrote HTML::TableExtract.

Comment on Re: Sucking Data off a Web Page Download Code

Replies are listed 'Best First'.
Re^2: Sucking Data off a Web Page by muba (Priest) on Oct 10, 2004 at 19:37 UTC
I'm just wondering if that module can handle colspan and rowspan cells well... I'm not saying that I think it's a bad module if it can not. Rather will I think it is pretty good if it can. `"2b"\|\|!"2b";$$_="the question"` Besides that, my code is untested unless stated otherwise. One more: please review the article about regular expressions (do's and don'ts) I'm working on.	[reply] [d/l]
Re^3: Sucking Data off a Web Page by iblech (Friar) on Oct 11, 2004 at 18:08 UTC
Just to resolve doubts, HTML::TableExtract does handle columnspan/rowspan correctly. Quoting the POD: Furthermore, TableExtract will automatically compensate for cell span issues so that columns are really the same columns as you would visually see in a browser.	[reply]