in reply to PERL HTML::TableExtractor

I was getting some odd output from HTML::TableExtract. Closer inspection of the html revealed 29 open table tags and 10 closing table tags. H::TE is allowed to be confused by a mess like that.

I tried with HTML::TreeBuilder, only looking for the cells we are interested in.

(I saved the source to a file for testing)

#!/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $filename = q{html/monk.html}; my $r = HTML::TreeBuilder->new; $r->parse_file($filename); # <td width="48%" valign="top"> my @cells = $r->look_down( _tag => q{td}, width => q{48%}, valign => q{top}, ); my $i; for my $cell (@cells){ my $bold = $cell->look_down(_tag => q{b}); print $bold->as_text, qq{\n}; for my $item ($cell->content_refs_list) { next if ref $$item; print $$item, qq{\n}; } my $link = $cell->look_down( _tag => q{a}, ); print $link->attr(q{href}), qq{\n\n}; last if $i++ > 2; }
output (extract)
SERVPRO® of Central Alabama Wilson, David & Christie Phone: (205)678-2224 Fax: (205)678-2226 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2196 SERVPRO® of South Alabama Johnson, Walter G. Phone: (251)661-9282 Fax: (251)660-7539 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2212 SERVPRO® of Northern Alabama Wilson, David & Christie Phone: (205)678-2224 Fax: (205)678-2226 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2233 SERVPRO® of Central Alabama II Wilson, David & Christie Phone: (205)678-2224 Fax: (205)678-2226 http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226
Update: Tidied up the output

Replies are listed 'Best First'.
Re^2: PERL HTML::TableExtractor
by jdlev (Scribe) on Dec 22, 2008 at 18:09 UTC
    Thanks for the help! That worked great for printing to the cmd prompt...I have one last question, and then I'm going to try to get it to do the rest myself.

    Each of the above lines needs to be put into a database field in other words in your example you printed:

    my @cells = $r->look_down(
    _tag => q{td},
    width => q{48%},
    valign => q{top},
    );

    Output:
    SERVPRO® of Central Alabama II
    Wilson, David & Christie
    Phone: (205)678-2224
    Fax: (205)678-2226
    http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226

    How would I pull out the following information into different variables such as:

    $location = SERVPRO® of Central Alabama II
    $phone = (205) 678-2224
    $fax = (205) 678.2226
    $website = http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226

    Also, I had a few questions about the functions in your code and hope you could tell me what they do in plan english so that I can start implementing them? :)

    Here is the code you created:

    for my $cell (@cells){

    my $bold = $cell->look_down(_tag => q{b});
    print $bold->as_text, qq{\n};

    for my $item ($cell->content_refs_list) {
    next if ref $$item;
    print $$item, qq{\n};
    }

    My first question is on

    for my $item ($cell->content_refs_list)

    Here is what I understand. The "for" loop is creating a new value for $item for each item in the array ($cell->content_refs_list), correct? So what is the $cell->contents_ref_list creating, and how does it know to create a new line at each break in the data? In general, what does the "->" do, and what does "content_refs_list" refer to?

    Next you print $$item. Why use two $$ here?

    I think I understand the rest, so if you could explain and help me with the points above, I should be good to go! Thanks for the awesome help...OOOMMMM!!! :)

      HTML::TreeBuilder builds trees of HTML::Elements. The methods we'll be looking at come from there. Keep the docs handy. :-)

      The table cells we are interested in look like the following (tidied up):

      <td width="48%" valign="top"> <b>SERVPRO<sup><small>&#174;</small></sup>of Northern Alabama</b> <br> Wilson, David & Christie <br> Phone: (205)678-2224 <br> Fax: (205)678-2226 <br> <a href='http://www.servpro.com/'>Visit their web site</a> </td>
      First we get an array of all those cells (an array of H::E objects)
      my @cells = $r->look_down( _tag => q{td}, width => q{48%}, valign => q{top}, );
      In scalar context $obj->look_down returns the first found, in list context it returns all of them.

      For each cell

      for my $cell (@cells){
      we first look down for the the bold tag element and print out the text within it
      my $bold = $cell->look_down(_tag => q{b}); print $bold->as_text, qq{\n};
      we then iterate over a list of the elements
      for my $item ($cell->content_refs_list) { next if ref $$item; print $$item, qq{\n}; }
      $obj->content_refs_list is another H::E method which, as you might guess, returns a list of references. Each reference is either a reference to an H::E object (i.e. another ref) or a reference to text. next if ref $$item; skips over other H::E objects (in this case the <b>, <br> and <a> tags) so what is left is a reference to text. $$item dereferences the reference.

      In fact this looks very similar to the example in the H::E docs. So go see. :-)

      Finaly we want to look down for the anchor tag

      my $link = $cell->look_down( _tag => q{a}, );
      and print out the href attribute
      print $link->attr(q{href}), qq{\n\n};
      Rather than print out the results you could push them onto an array (say, @record) so that $record[0] would be the location, $record[1] the phone number etc..

      You can get the low down on the arrow -> in perlreftut and perlref. We use it here to call an objects method.

      Good luck!