in reply to Table.pm: Extract text from html tables

Hmm. Were you aware of both HTML::Table and HTML::TableExtract in the CPAN? If not, learn to search for modules, and save yourself some development time!

-- Randal L. Schwartz, Perl hacker

  • Comment on Re: Table.pm: Extract text from html tables

Replies are listed 'Best First'.
Re: Re: Table.pm: Extract text from html tables
by zzspectrez (Hermit) on Jan 19, 2001 at 05:58 UTC

    I did do a search on search.cpan.org first. I agree with you that it is better not to reinvent the wheel if you dont have to. Because not only are you wasting time but the established code will probably be more efficient or at the least better debugged.

    However, I dont think HTML::Table applies well in this situation because it is for creating tables. I just want to get the data.

    I did install HTML::TableExtract before attempting it myself. However, it did not seem to work well for my needs. The author states that it was designed in the mind of selecting table data based off table headers. In my case the site I am accessing doesnt utilize text headers in its tables at all. This module also allows selecting data by using Depth and Count.

    From the pod.

    Depth and Count are more specific ways to specify tables in relation to one another. Depth represents how deeply a table resides in other tables. The depth of a top-level table in the document is 0. A table within a top-level table has a depth of 1, and so on. Each depth can be thought of as a layer; tables sharing the same depth are on the same layer. Within each of these layers, Count represents the order in which a table was seen at that depth, starting with 0. Providing both a depth and a count will uniquely specify a table within a document.

    This seems confusing to me when you have a document such as that I am accesing that has multiple top level tables with many sub tables beneath them.

    My solution allows me to access the table data just as by accesing the table data through a multideminsional array. Just count each <table> tag untill you are in the table that contains the data you want then note the row and column from that table and then accessing as $table->[table_number][row][column]. Seem much easier and in my opinion a better tool for my perticular situation. Of course HTML::TableExtract is a much more robust way to handle tables and better for situation where you can select the tables using headers instead of hard coding to the page layout.

    If you disagree with this, I would be interested your reasons why. I respect your opinion, as a known perl wizard!

    Thanks!
    zzSPECTREz

Re: Re: Table.pm: Extract text from html tables
by extremely (Priest) on Jan 19, 2001 at 05:38 UTC
    I dunno, merlyn if you were already committed to using HTML::Parser this might be nice to have about. Once you are carrying the whole toolbox in, it seems a shame to go back for one more tool.

    --
    $you = new YOU;
    honk() if $you->love(perl)