davido has asked for the wisdom of the Perl Monks concerning the following question:

I'm attempting to use HTML::TableExtract to grab the contents of a table of data from an HTML source. I'm using the 'headers' method of finding the appropriate table within a nest of several tables. However, I've been unsuccessful so far in getting the following test-script to produce any output at all. This is troubling, since the snippet is almost verbatum taken from the Synopsis provided in the POD for the module in question.

use strict; use warnings; use LWP::Simple; use HTML::TableExtract; #my $page = get( 'http://www.garmin.com/support/download.jsp'); my $raw_html = do { open my $in, '<', 'garmin.htm' or die "Can't open infile: $!\n"; local $/ = undef; <$in>; }; my $te = new HTML::TableExtract( headers => ["Product Name", "Software Version", "Compatible with Versions +", "Date" ] ); $te->parse($raw_html); # Examine all matching tables foreach my $ts ( $te->table_states ) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ( $ts->rows ) { print join( ',', @$row ), "\n"; } }

The table I'm trying to grab is found at http://www.garmin.com/support/download.jsp.

This is entirely for personal use, and not really all that important of a script. I already have a working version that uses regexes to pull the appropriate data and notify me if there's been an update to one of the particular devices I'm interested in, and even without that, Garmin has an email notification system in place. But I wanted to see if I could rewrite it using a more robust parser.

Any suggestions on where my snippet is failing to enable the module to find the table I'm searching for would be appreciated.


Dave

Replies are listed 'Best First'.
Re: Using HTML::TableExtract
by sacked (Hermit) on Jun 18, 2004 at 16:15 UTC
    The original html output has newlines and multiple spaces in the table header names, so they don't match the headers you specified. Adding the following after retrieving the html corrected the problem:
    $raw_html =~ tr/\n//d; $raw_html =~ tr/ //s;
    The output is now:
    Table (2,4): eMap, 2.90, All, August 7, 2003 eTrex, 2.14, All, June 14, 2002 ...

    --sacked
      Your tr statements can be combined:
      $raw_html =~ tr/ \n/ /ds;

      The power of tr///
Re: Using HTML::TableExtract
by mojotoad (Monsignor) on Jun 18, 2004 at 18:26 UTC
    As others have pointed out, the new lines in the original source header strings are the culprit.

    What is generally missed about header-based extraction with HTML::TableExtract, however, is that the strings that you use to define the headers will eventually be turned into case-insensitive regular expressions.

    So change the following part:

    my $te = new HTML::TableExtract( headers => ["Product Name", "Software Version", "Compatible with Versions +", "Date" ] );
    to this (notice the single quotes..otherwise you'll have to escape your backslashes):
    my $te = new HTML::TableExtract( headers => ['Product\s+Name', 'Software\s+Version', 'Compatible\s+with\s+Vers +ions", 'Date' ] );
    ...and things will work as you expect. Also note that rather than strings, you can pass pre-compiled regexps from qr//, like so:
    my $te = new HTML::TableExtract( headers => [qr/Product\s+Name/, qr/Software\s+Version/, qr/Compatible\s+with\s+Ve +rsions/, 'Date' ] );

    Cheers,
    Matt

      Thanks for all the answers everyone. I had it in the back of my mind that the problem may have been related to embeded newlines, but tried embedding my own in the header search strings, and just didn't get the combination quite right. The tr/// suggestion was helpful.

      But I particularly like the fact that I can pass a regexp in. As I thought the issue over I actually thought to myself, "I wish I could just pass in a regexp." Viola, I can. ;)

      Thanks again.


      Dave

Re: Using HTML::TableExtract
by jZed (Prior) on Jun 18, 2004 at 16:06 UTC
    Maybe it's the carriage returns inside the field names of the table (Product \n Name). Try munging the names into a form you can find. If that doesn't work, count the tables and use the count param for TableExtract.