mecrazycoder has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks, I am writing a web crawler for extracting HTML tables from different sites. For this I am using HTML-TableExtract module. The problem what I am facing is some of the columns of those tables will be keeping on changing but some are static. So is it possible to retrieve entire table by using only static columns names. Because HTML-TableExtract module document says it will retrieve data only from the matched columns. So please let me know is there a way to retrieve entire table using only static column names as header or can you please suggest me any other way of do it. Thanks in advance.
  • Comment on Retrieving tables using Html:TableExtract

Replies are listed 'Best First'.
Re: Retrieving tables using Html:TableExtract
by ww (Archbishop) on Feb 12, 2012 at 14:57 UTC

    I'm not sure I understand your statement of circumstances. First, what do you mean by "dynamic" columns? "Columns whose quantity and or headers may vary from one time to another" is what occurs to me, except that that makes little sense, so probably I'm not clear about your intent.

    And you go on to say, "So is it possible to retrieve entire table by using only static columns names. " which I take to be a question, despite the lack of a question mark. If so, the answer is "No, not with Table::Extract. For that review other modules such as HTML::Parser or members of the WWW::... group (esp. WWW::Mechanize)."

    The docs for HTML::Extract say quite specifically

    ... tables can be matched using column headers, depth, count within a depth, table tag attributes, or some combination of the four.

    and, again, in the DESCRIPTION section,

    There are currently four constraints available to specify which tables you would like to extract from a document: *Headers*, *Depth*, *Count*, and *Attributes*.

    Skipping back toward the top of the doc, does the second example relate to your question?

    The third example, using tags to ID by attributes, seems unlikely to fit your problem description, and once again, it relies on the programmer knowing the header names desired... and on the web-monkeys having used a header row of <th>...</th> labels... something I wouldn't want to guarantee (as a P/T web-monkey, myself), as some are [lazy ignorant limited-to-inadequate-tools] and because some tables content is too obvious in intent to justify the added header code.

    And, if none of this addresses your concerns, please set me straight by clarifying the question.

      Thank you. BTW the columns header will be changing.
        Well, you can always pre-read the header with an HTML parser -- too many ways to even discuss -- and then pick, either manually, ad hoc... or by setting up a table of headers that *MIGHT* occur that you think you'll want... and then entering your scraping routine.

        Yes, I realize choice 1 doesn't automate... and 2 if fraught with possibilities for missing stuff you want.