jdlev has asked for the wisdom of the Perl Monks concerning the following question:

OOOOOMMMMMMMMM!!!!!

Hello perl gurus...I've been struggling with reformatting the output of data in HTML::TableExtractor. Here is ultimately what I want to do: 1) Go to website 2) Scrape table on website based on their location on the page, because there are no freakin headers! 3) Pull the information gathered into a mysql database

So far, here is what I've gotten:

use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
my $te = new HTML::TableExtract( depth=>3, count=>0, gridmap=>0);
my $content = get("http://www.servpro.com/locator/lookup.asp?stname=Alabama&state=AL");
$te->parse($content);
foreach $ts ($te->table_states)
{
foreach $row ($ts->rows)
{
print Dumper $row;
# print Dumper $row if (scalar(@$row) == 2);
}
}

If you save the above, and run it from your cmd prompt, it creates a very messy return, but hey, at least its something right! What I want to do is structure the data into an array so I can send the information to a mysql database.

I have no idea how to get HTML::TableExtractor to bring in an orderly fashion. I've tried to see what it is returning by simply printing one of the variables (like $row rather than using the DUMPER function) HTML::TableExtractor is tabulating, and it comes back by printing something like: "ARRAY(0x1b3243f)ARRAY(0x1b3432)"

First off, what are the strange things it is returning? I've never used dumper, but don't really like it. I would prefer to set a new string equal to each variable in the array. Then send that information to mysql.

One other quick question, can anyone tell me in plain english what
"foreach $ts ($te->table_states)" is telling the computer program to do?

Any help is greatly appreciated...hope everyone has a good weekend! :)

I need to go meditate after working on this stupid thing all day! OOOOOOOOOOOOOOOMMMMMMMMMMMMMMMMMMMMM

Replies are listed 'Best First'.
Re: PERL HTML::TableExtractor
by Limbic~Region (Chancellor) on Dec 20, 2008 at 02:08 UTC
    jdlev,
    Dumping a data structure is primarily used to help visualize your data or to serialize it. I would not use Data::Dumper to store things to a database unless I was trying to serialize a data structure. In other words, not if I intended for individual parts to be able to be searched after.

    It is no secret that I really do not coding anything web related so the following is probably the wrong way to do it, but it makes your intent a whole lot more clear:

    You can replace the call to Dumper with storing a record in the database. $rec->{owner} will contain the owner information and $rec->{website} will contain the url to the website, etc, etc.

    Cheers - L~R

Re: PERL HTML::TableExtractor
by wfsp (Abbot) on Dec 20, 2008 at 07:37 UTC
    I was getting some odd output from HTML::TableExtract. Closer inspection of the html revealed 29 open table tags and 10 closing table tags. H::TE is allowed to be confused by a mess like that.

    I tried with HTML::TreeBuilder, only looking for the cells we are interested in.

    (I saved the source to a file for testing)

    #!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $filename = q{html/monk.html}; my $r = HTML::TreeBuilder->new; $r->parse_file($filename); # <td width="48%" valign="top"> my @cells = $r->look_down( _tag => q{td}, width => q{48%}, valign => q{top}, ); my $i; for my $cell (@cells){ my $bold = $cell->look_down(_tag => q{b}); print $bold->as_text, qq{\n}; for my $item ($cell->content_refs_list) { next if ref $$item; print $$item, qq{\n}; } my $link = $cell->look_down( _tag => q{a}, ); print $link->attr(q{href}), qq{\n\n}; last if $i++ > 2; }
    output (extract)
    SERVPRO® of Central Alabama Wilson, David & Christie Phone: (205)678-2224 Fax: (205)678-2226 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2196 SERVPRO® of South Alabama Johnson, Walter G. Phone: (251)661-9282 Fax: (251)660-7539 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2212 SERVPRO® of Northern Alabama Wilson, David & Christie Phone: (205)678-2224 Fax: (205)678-2226 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2233 SERVPRO® of Central Alabama II Wilson, David & Christie Phone: (205)678-2224 Fax: (205)678-2226 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226
    Update: Tidied up the output
      Thanks for the help! That worked great for printing to the cmd prompt...I have one last question, and then I'm going to try to get it to do the rest myself.

      Each of the above lines needs to be put into a database field in other words in your example you printed:

      my @cells = $r->look_down(
      _tag => q{td},
      width => q{48%},
      valign => q{top},
      );

      Output:
      SERVPRO® of Central Alabama II
      Wilson, David & Christie
      Phone: (205)678-2224
      Fax: (205)678-2226
      http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226

      How would I pull out the following information into different variables such as:

      $location = SERVPRO® of Central Alabama II
      $phone = (205) 678-2224
      $fax = (205) 678.2226
      $website = http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226

      Also, I had a few questions about the functions in your code and hope you could tell me what they do in plan english so that I can start implementing them? :)

      Here is the code you created:

      for my $cell (@cells){

      my $bold = $cell->look_down(_tag => q{b});
      print $bold->as_text, qq{\n};

      for my $item ($cell->content_refs_list) {
      next if ref $$item;
      print $$item, qq{\n};
      }

      My first question is on

      for my $item ($cell->content_refs_list)

      Here is what I understand. The "for" loop is creating a new value for $item for each item in the array ($cell->content_refs_list), correct? So what is the $cell->contents_ref_list creating, and how does it know to create a new line at each break in the data? In general, what does the "->" do, and what does "content_refs_list" refer to?

      Next you print $$item. Why use two $$ here?

      I think I understand the rest, so if you could explain and help me with the points above, I should be good to go! Thanks for the awesome help...OOOMMMM!!! :)

        HTML::TreeBuilder builds trees of HTML::Elements. The methods we'll be looking at come from there. Keep the docs handy. :-)

        The table cells we are interested in look like the following (tidied up):

        <td width="48%" valign="top"> <b>SERVPRO<sup><small>&#174;</small></sup>of Northern Alabama</b> <br> Wilson, David & Christie <br> Phone: (205)678-2224 <br> Fax: (205)678-2226 <br> <a href='http://www.servpro.com/'>Visit their web site</a> </td>
        First we get an array of all those cells (an array of H::E objects)
        my @cells = $r->look_down( _tag => q{td}, width => q{48%}, valign => q{top}, );
        In scalar context $obj->look_down returns the first found, in list context it returns all of them.

        For each cell

        for my $cell (@cells){
        we first look down for the the bold tag element and print out the text within it
        my $bold = $cell->look_down(_tag => q{b}); print $bold->as_text, qq{\n};
        we then iterate over a list of the elements
        for my $item ($cell->content_refs_list) { next if ref $$item; print $$item, qq{\n}; }
        $obj->content_refs_list is another H::E method which, as you might guess, returns a list of references. Each reference is either a reference to an H::E object (i.e. another ref) or a reference to text. next if ref $$item; skips over other H::E objects (in this case the <b>, <br> and <a> tags) so what is left is a reference to text. $$item dereferences the reference.

        In fact this looks very similar to the example in the H::E docs. So go see. :-)

        Finaly we want to look down for the anchor tag

        my $link = $cell->look_down( _tag => q{a}, );
        and print out the href attribute
        print $link->attr(q{href}), qq{\n\n};
        Rather than print out the results you could push them onto an array (say, @record) so that $record[0] would be the location, $record[1] the phone number etc..

        You can get the low down on the arrow -> in perlreftut and perlref. We use it here to call an objects method.

        Good luck!

Re: PERL HTML::TableExtractor
by jethro (Monsignor) on Dec 20, 2008 at 05:21 UTC

    "ARRAY(0x1b3243f)" is what you get when you print out a reference aka pointer variable

    > perl -e ' @a=(1,2,3); $t= \@a; print $t,"\n";' ARRAY(0x806d3b4)

    You can't effectively print out complex data structures without using something like Data::Dumper