howdoesitwork has asked for the wisdom of the Perl Monks concerning the following question:

Hello there, am pretty newish to perl,mainly have a java background, so do let me know if I'm missing something completely obvious.

I have been looking at various examples to try to get them to work, but so far haven't been able to get any using parse_file to work, only managed to get one working, but it was using parse() for parsing a html string.

For more background, I'm on windows 64 bit, and using strawberry perl, and I did install all the prerequisites for html::tableextract thru cpan. If possible, what I'd love is for an example of extracting table data from a html file already saved locally, and I should hopefully be able to fumble my way around from there.

Essentially, what I need to do is to extract some Table rows from a html file thats saved on my computer. And my apologies for the pretty horribly formatted post, and thanks for having a look!

edit: can't seem to post in the thread, probably doing something wrong.

aitap: This is part of the file I'll be parsing (it's pretty horribly formatted, and there are empty td tags sometimes.)

<tr> <td>2012/07/30</td> <td><a href="http://www.zone-h.org/archive/special=1/notifier=Dg4nx">Dg4nx</a +></td> <td>H</td> <td></td> <td><a href="http://www.zone-h.org/archive/domain=www.bauan.gov.ph">R</a></td +> <td><img src=" +../../images/cflags/png/us.png" alt="United States" title="United Sta +tes"></td> <td><img src="../../images/star.gif" borde +r="0"></td> <td>www.bauan.gov.ph </td> <td>Linux</td> <td><a href="http://www.zone-h.org/mirror/id/18160940">mirror</a></td> </tr>

As to examples, one I'm trying is http://search.cpan.org/~msisk/HTML-TableExtract-2.10/lib/HTML/TableExtract.pm but I seem to be missing something. I keep seeing a "can't call method "tree" on an undefined value at line 5" error when using this code from the TableExtracts examples(I have tried parsing in a html file $html_file = "page1.html"; , but it doesn't seem to be working)

use HTML::TableExtract qw(tree); $te = HTML::TableExtract->new( headers => [qw(Date Notifier H M R L D +omain OS View)] ); $te->parse_file($html_file); $table = $te->first_table_found; $table_tree = $table->tree; $table_html = $table_tree->as_HTML; $table_text = $table_tree->as_text; $document_tree = $te->tree; $document_html = $document_tree->as_HTML;

(My input likely won't fit this, but I'm just trying to get an example working to start with, I know I'm missing something, but not quite sure what.

influx: I'll give that a shot, thanks. appreciate the responses!

Replies are listed 'Best First'.
Re: Queries on HTML::TableExtract - How to parse from saved html file
by influx (Beadle) on Aug 08, 2012 at 08:37 UTC

    I don't know much about that module, but if parse_file() isn't very persistent, then perhaps you could slurp the file into a string and continue using the parse() method instead

    For larger files you might be better off using File::Slurp or something

    use File::Slurp 'read_file'; my $html = "/path/to/file.html"; my $str = read_file($html);

    Once you've done that, then you can just parse $str as you normally would a string.

Re: Queries on HTML::TableExtract - How to parse from saved html file
by Anonymous Monk on Aug 08, 2012 at 10:10 UTC

    can't call method "tree" on an undefined value at line 5

    That means first_table_found did not find any tables, it can happen

      hmm.. gotcha =/ i'll keep trying, then, thanks.
        there are such things as "css div tables" that use div and css and look like tables in modern browsers but aren't, tableextract won't help you with those
Re: Queries on HTML::TableExtract - How to parse from saved html file
by aitap (Curate) on Aug 08, 2012 at 08:31 UTC

    Can you post a small example of your file, so it will be easier to help you parse it?

    Posting examples usually helps others to help you, when it's an example of code, file to be parsed or error message.

    Sorry if my advice was wrong.