Queries on HTML::TableExtract - How to parse from saved html file

howdoesitwork has asked for the wisdom of the Perl Monks concerning the following question:

Hello there, am pretty newish to perl,mainly have a java background, so do let me know if I'm missing something completely obvious.

I have been looking at various examples to try to get them to work, but so far haven't been able to get any using parse_file to work, only managed to get one working, but it was using parse() for parsing a html string.

For more background, I'm on windows 64 bit, and using strawberry perl, and I did install all the prerequisites for html::tableextract thru cpan. If possible, what I'd love is for an example of extracting table data from a html file already saved locally, and I should hopefully be able to fumble my way around from there.

Essentially, what I need to do is to extract some Table rows from a html file thats saved on my computer. And my apologies for the pretty horribly formatted post, and thanks for having a look!

edit: can't seem to post in the thread, probably doing something wrong.

aitap: This is part of the file I'll be parsing (it's pretty horribly formatted, and there are empty td tags sometimes.)

            <tr>                            <td>2012/07/30</td>
                            <td><a
href="http://www.zone-h.org/archive/special=1/notifier=Dg4nx">Dg4nx</a
+></td>
                            <td>H</td>
                            <td></td>
                            <td><a
href="http://www.zone-h.org/archive/domain=www.bauan.gov.ph">R</a></td
+>
                                                        <td><img src="
+../../images/cflags/png/us.png" alt="United States" title="United Sta
+tes"></td>
                            <td><img src="../../images/star.gif" borde
+r="0"></td>
                            <td>www.bauan.gov.ph
                            </td>
                            <td>Linux</td>
                            <td><a
href="http://www.zone-h.org/mirror/id/18160940">mirror</a></td>
                        </tr>
[download]

As to examples, one I'm trying is http://search.cpan.org/~msisk/HTML-TableExtract-2.10/lib/HTML/TableExtract.pm but I seem to be missing something. I keep seeing a "can't call method "tree" on an undefined value at line 5" error when using this code from the TableExtracts examples(I have tried parsing in a html file $html_file = "page1.html"; , but it doesn't seem to be working)

 use HTML::TableExtract qw(tree);
 $te = HTML::TableExtract->new( headers => [qw(Date Notifier H M R L D
+omain OS View)] );
 $te->parse_file($html_file);
 $table = $te->first_table_found;
 $table_tree = $table->tree;
 $table_html = $table_tree->as_HTML;
 $table_text = $table_tree->as_text;
 $document_tree = $te->tree;
 $document_html = $document_tree->as_HTML;
[download]

(My input likely won't fit this, but I'm just trying to get an example working to start with, I know I'm missing something, but not quite sure what.

influx: I'll give that a shot, thanks. appreciate the responses!

Comment on Queries on HTML::TableExtract - How to parse from saved html file Select or Download Code

Replies are listed 'Best First'.
Re: Queries on HTML::TableExtract - How to parse from saved html file by influx (Beadle) on Aug 08, 2012 at 08:37 UTC
I don't know much about that module, but if parse_file() isn't very persistent, then perhaps you could slurp the file into a string and continue using the parse() method instead For larger files you might be better off using File::Slurp or something `use File::Slurp 'read_file'; my $html = "/path/to/file.html"; my $str = read_file($html);` [download] Once you've done that, then you can just parse $str as you normally would a string.	[reply] [d/l]
Re: Queries on HTML::TableExtract - How to parse from saved html file by Anonymous Monk on Aug 08, 2012 at 10:10 UTC
can't call method "tree" on an undefined value at line 5 That means first_table_found did not find any tables, it can happen	[reply]
Re^2: Queries on HTML::TableExtract - How to parse from saved html file by howdoesitwork (Initiate) on Aug 13, 2012 at 01:23 UTC
hmm.. gotcha =/ i'll keep trying, then, thanks.	[reply]
Re^3: Queries on HTML::TableExtract - How to parse from saved html file by Anonymous Monk on Aug 13, 2012 at 01:26 UTC
there are such things as "css div tables" that use div and css and look like tables in modern browsers but aren't, tableextract won't help you with those	[reply]
Re^4: Queries on HTML::TableExtract - How to parse from saved html file by howdoesitwork (Initiate) on Aug 13, 2012 at 01:54 UTC
Re: Queries on HTML::TableExtract - How to parse from saved html file by aitap (Curate) on Aug 08, 2012 at 08:31 UTC
Can you post a small example of your file, so it will be easier to help you parse it? Posting examples usually helps others to help you, when it's an example of code, file to be parsed or error message. Sorry if my advice was wrong.	[reply]