Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Sucking Data off a Web Page

by Corion (Patriarch)
on Oct 10, 2004 at 08:55 UTC ( [id://397972]=note: print w/replies, xml ) Need Help??


in reply to Sucking Data off a Web Page

jeffa mojotoad wrote a really great scraping module, HTML::TableExtract, which easily scrapes an HTML table into an array of arrays, which you can then convert to a csv file again, or stuff it into DBI directly. For example, the following code tries to extract all rows from "the one table" on the page:

my $te = HTML::TableExtract->new(); $te->parse($html); foreach $row ($te->rows) { print join(',', @$row), "\n"; }

The only problem there is with your table is, that it is not organized in columns but in rows, so you will have to flip the table.

Update: I realized that it was mojotoad, not jeffa who wrote HTML::TableExtract.

Replies are listed 'Best First'.
Re^2: Sucking Data off a Web Page
by muba (Priest) on Oct 10, 2004 at 19:37 UTC
    I'm just wondering if that module can handle colspan and rowspan cells well... I'm not saying that I think it's a bad module if it can not. Rather will I think it is pretty good if it can.




    "2b"||!"2b";$$_="the question"
    Besides that, my code is untested unless stated otherwise.
    One more: please review the article about regular expressions (do's and don'ts) I'm working on.

      Just to resolve doubts, HTML::TableExtract does handle columnspan/rowspan correctly. Quoting the POD:

      Furthermore, TableExtract will automatically compensate for cell span issues so that columns are really the same columns as you would visually see in a browser.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://397972]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-28 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found