Kyshtynbai has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone!

I'm having a problem with HTML::TableExtractor module. I've been fighting it all night long. If you guys could help, I'd be very grateful!

Let's say I have a table in HTML, please take a look at it's code:

<html> <head> <title> Some animals and letters: </title> </head> <body> <table border = "1"> <caption> <h4>table</h4> </caption> <thead> <tr> <th></th> <th colspan="3">1st header</th> <th colspan="3">2nd header</th> <th colspan="3">3rd header</th> </tr> <tr> <th></th> <th colspan="3">subhead1</th> <th colspan="3">subhead2</th> <th colspan="3">subhead3</th> </tr> </thead> <tbody> <tr> <td></td> <td>text</td> <td>more text</td> <td>some more text</td> <td>dog</td> <td>bear</td> <td>cat</td> <td>toocan</td> <td>inu</td> <td>pes</td> </tr> </tbody> </table> </body> <html>

and I want to extract or pass somewhere, or change somehow "subhead1" and "subhead2" columns (with the third row!). Here is the perl code for it:

#!/usr/bin/perl use HTML::TableExtract; use Text::Table; use Data::Dumper; use strict; my $content = 'table.html'; my $headers = ['subhead1', 'subhead2']; my $tbl_extr = HTML::TableExtract->new(headers => $headers); my $tbl_out = Text::Table->new(@$headers); $tbl_extr->parse_file($content); my ($table) = $tbl_extr->tables; my $row; foreach $row ($table->rows) { $tbl_out->load($row); } print $tbl_out;

But what I get is:

~/www$ ./tblext.pl subhead1 subhead2 text dog

And I need to get all of the entries at the third row! Could anyone please point to a mistake in the code?

Thank you in advance.

Replies are listed 'Best First'.
Re: HTML::TableExtractor and embedded columns
by vinoth.ree (Monsignor) on Mar 17, 2014 at 08:03 UTC

    how about Using table tag attributes.

    use Data::Dumper; use HTML::TableExtract; use Text::Table; my $content = './table.html'; my $headers = ['subhead1', 'subhead2']; #my $te = HTML::TableExtract->new(headers => $headers); my $tbl_extr = HTML::TableExtract->new(attribs => { border => 1 }); my $tbl_out = Text::Table->new(@$headers); $tbl_extr->parse_file($content); foreach my $ts ($tbl_extr->tables) { print "Table with border=1 found at ", join(',', $ts->coords), + ":\n"; foreach my $row ($ts->rows) { print " ", join(',', @$row), "\n"; } }

    All is well
      I'll check that out, thanks. But using attribs doesn't solve the problem. Well, it may solve it in this particular case, but what if no attributes are specified in table tags? That's what I'm thinking about.
Re: HTML::TableExtractor and embedded columns
by Anonymous Monk on Mar 17, 2014 at 08:03 UTC

    Could anyone please point to a mistake in the code?

    Ask yourself these questions (and answer them)
    is the problem what tableextract gives you ($row)? What is $row?
    Or is the problem what you give to load ($row)? What does load do?

    If you divorce the two (tableextract and text::table), and you employ ddumperBasic debugging checklist, you can figure out the problem

    As I see it, either tableextract doesn't give you everything, or load() isn't the same as add()

    I suspect load() isn't the same as add() , because load means replace, and add means extend :)

      I've already used Data::Dumper to find out where it breaks. And it breaks before load method is called. So it seems that tableextract doesn't parse the rest of td's :(. I can't figure out why, however. Gotta think more.