imielins has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I'm using HTML::TableExtract in the effort to grab data out of specific cells on an HTML table. However, HTML::TableExtract only seems to spit back raw text of cells, and thus does not give me any ability to parse information using HTML inside the cell (i.e. list items).

Supposedly one can set a "decode" parameter to false when instantiating the TableExtracter, however this does not change a thing. The module still spits back (strangely concatenated) plain text from inside the cells.

So I went into the code of the TableExtract module and found the reason why - namely that the parameter controls a switch in the "text" event handler that to either decode or not touch the text it received before it bubbles up to output. However, debugging the code, I found that the text event handler already receives plain HTML-stripped text, so this switch make no difference on whether you receive plain text or HTML in the end.

Now my understanding of the HTML::Parser module (which TableExtract subclasses) is not sufficient to find out where I could find or pick up those tags before they get stripped from the text. I suppose it will require some modification of the TableExtract module. Has anybody examined the innards of this module or has a good understanding of the HTML::Parser and can give me a clue on what to try next?

Thanks -
M_ski

  • Comment on Parsing Cell Contents of Extracted HTML Tables

Replies are listed 'Best First'.
Re: Parsing Cell Contents of Extracted HTML Tables
by epoptai (Curate) on Jun 29, 2001 at 23:43 UTC
    I can't solve your problem but can tell you that the 'decode' method only toggles the use of HTML::Entities. Look into the 'br_translate' method which translates <br> to \n to eliminate the strange concatenation.

    Perhaps you could use the information extracted from the table to reparse the file for links and such.

    --
    Check out my Perlmonks Related Scripts like framechat, reputer, and xNN.