jpeg has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I'm having a blond moment with HTML::TableExtract.
The URL I'm parsing has anywhere from 1 to 6 tables I'm interested in, no headers, and there are tables at the same depth I'm not interested in. Therefore I'm using depth/count tuples to specify which tables I want to parse like so:
for ($mycount = 1; $mycount <=9; $mycount++) { my $te = new HTML::TableExtract ( depth => 1, count => $mycount ) +; # more stuff }
Of course, creating 9 objects and parsing the same file 9 times bugs me. I looked at the docs more closely and saw
reset_state() If you are using the same HTML::TableExtract object for multiple p +arses, call this between each parse to wipe the internal slate clean.
which leads me to believe it's possible to reuse the $te object. However, I don't see a way to specify depth and count for an already existing object. Has anyone done this?

Thanks,
John

--
jpg

Replies are listed 'Best First'.
Re: How to re-use HTML::TableExtract objects?
by tphyahoo (Vicar) on May 30, 2005 at 15:29 UTC
    Is the $te object a hashref? Many of the html::Parser type objects seem to be. If so, I believe you could do something like
    $te->{depth} = $whatever;
    and then manipulate that how you like. However, I confess to being an object newbie. If I'm wrong, or this idea just wasted your time, you can slap me with a fish ;)
      The object is indeed a hashref, as the following snippet shows, and so your approach should work fine:
      use strict; use warnings; use Data::Dumper; use HTML::TableExtract; my $te = HTML::TableExtract->new( depth => 2, count => 2 ); print Dumper ($te);

      Output:
      $VAR1 = bless( { '_ts_sequential' => [], 'headers' => undef, 'br_translate' => 1, 'gridmap' => 1, 'strip_html_on_match' => 0, 'subtables' => undef, 'decode' => 1, 'keep_headers' => 0, '_in_a_table' => 0, 'keep' => 0, 'debug' => 0, '_tables' => {}, '_cdepth' => -1, 'elastic' => 1,
      'count' => 2, 'depth' => 2,
      'automap' => 1, 'keepall' => 0, 'error_handle' => \*::STDOUT, 'attribs' => undef, 'keep_html' => 0, 'chain' => undef, '_hparser_xs_state' => \25467620, 'slice_columns' => 1, '_counts' => {}, '_tablestack' => [], '_table_mapback' => {} }, 'HTML::TableExtract' );
      But I slap you anyway, just for fun :-)


      holli, /regexed monk/
      Bueno bueno good good good!
      I wouldn't have thought of poking the objects that way. Cool. Thanks to both you and holli.

      *tosses you a fish*

      --
      jpg
Re: How to re-use HTML::TableExtract objects?
by graff (Chancellor) on May 30, 2005 at 16:55 UTC
    I haven't used HTML::TableExtract myself, but my reading of the man page leads me to think that in a single parse/pass over the input it will extract all tables that match a given set of constraints (or specify no constraints and have it extract all tables, period). After that, you can use "count" and "depth" values as coordinates in order to fetch the content of specific tables from the data structure that it uses to store all the tables that were extracted.

    So if the tables you are interested in happen to share a common set of column headings, you can just use the "headers" parameter when creating a single "TableExtract" object, then use that one object repeatedly in a loop to fetch the contents for each of the tables that have that set of headers. Check out the "table_states()" method.