in reply to Sucking Data off a Web Page

jeffa mojotoad wrote a really great scraping module, HTML::TableExtract, which easily scrapes an HTML table into an array of arrays, which you can then convert to a csv file again, or stuff it into DBI directly. For example, the following code tries to extract all rows from "the one table" on the page:

my $te = HTML::TableExtract->new(); $te->parse($html); foreach $row ($te->rows) { print join(',', @$row), "\n"; }

The only problem there is with your table is, that it is not organized in columns but in rows, so you will have to flip the table.

Update: I realized that it was mojotoad, not jeffa who wrote HTML::TableExtract.

Replies are listed 'Best First'.
Re^2: Sucking Data off a Web Page
by muba (Priest) on Oct 10, 2004 at 19:37 UTC
    I'm just wondering if that module can handle colspan and rowspan cells well... I'm not saying that I think it's a bad module if it can not. Rather will I think it is pretty good if it can.




    "2b"||!"2b";$$_="the question"
    Besides that, my code is untested unless stated otherwise.
    One more: please review the article about regular expressions (do's and don'ts) I'm working on.

      Just to resolve doubts, HTML::TableExtract does handle columnspan/rowspan correctly. Quoting the POD:

      Furthermore, TableExtract will automatically compensate for cell span issues so that columns are really the same columns as you would visually see in a browser.