in reply to Re: PERL HTML::TableExtractor
in thread PERL HTML::TableExtractor

Thanks for the help! That worked great for printing to the cmd prompt...I have one last question, and then I'm going to try to get it to do the rest myself.

Each of the above lines needs to be put into a database field in other words in your example you printed:

my @cells = $r->look_down(
_tag => q{td},
width => q{48%},
valign => q{top},
);

Output:
SERVPRO® of Central Alabama II
Wilson, David & Christie
Phone: (205)678-2224
Fax: (205)678-2226
http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226

How would I pull out the following information into different variables such as:

$location = SERVPRO® of Central Alabama II
$phone = (205) 678-2224
$fax = (205) 678.2226
$website = http://www.servpro.com/franchises/enhanced_asp/default.asp?fn=2226

Also, I had a few questions about the functions in your code and hope you could tell me what they do in plan english so that I can start implementing them? :)

Here is the code you created:

for my $cell (@cells){

my $bold = $cell->look_down(_tag => q{b});
print $bold->as_text, qq{\n};

for my $item ($cell->content_refs_list) {
next if ref $$item;
print $$item, qq{\n};
}

My first question is on

for my $item ($cell->content_refs_list)

Here is what I understand. The "for" loop is creating a new value for $item for each item in the array ($cell->content_refs_list), correct? So what is the $cell->contents_ref_list creating, and how does it know to create a new line at each break in the data? In general, what does the "->" do, and what does "content_refs_list" refer to?

Next you print $$item. Why use two $$ here?

I think I understand the rest, so if you could explain and help me with the points above, I should be good to go! Thanks for the awesome help...OOOMMMM!!! :)

Replies are listed 'Best First'.
Re^3: PERL HTML::TableExtractor
by wfsp (Abbot) on Dec 23, 2008 at 06:56 UTC
    HTML::TreeBuilder builds trees of HTML::Elements. The methods we'll be looking at come from there. Keep the docs handy. :-)

    The table cells we are interested in look like the following (tidied up):

    <td width="48%" valign="top"> <b>SERVPRO<sup><small>&#174;</small></sup>of Northern Alabama</b> <br> Wilson, David & Christie <br> Phone: (205)678-2224 <br> Fax: (205)678-2226 <br> <a href='http://www.servpro.com/'>Visit their web site</a> </td>
    First we get an array of all those cells (an array of H::E objects)
    my @cells = $r->look_down( _tag => q{td}, width => q{48%}, valign => q{top}, );
    In scalar context $obj->look_down returns the first found, in list context it returns all of them.

    For each cell

    for my $cell (@cells){
    we first look down for the the bold tag element and print out the text within it
    my $bold = $cell->look_down(_tag => q{b}); print $bold->as_text, qq{\n};
    we then iterate over a list of the elements
    for my $item ($cell->content_refs_list) { next if ref $$item; print $$item, qq{\n}; }
    $obj->content_refs_list is another H::E method which, as you might guess, returns a list of references. Each reference is either a reference to an H::E object (i.e. another ref) or a reference to text. next if ref $$item; skips over other H::E objects (in this case the <b>, <br> and <a> tags) so what is left is a reference to text. $$item dereferences the reference.

    In fact this looks very similar to the example in the H::E docs. So go see. :-)

    Finaly we want to look down for the anchor tag

    my $link = $cell->look_down( _tag => q{a}, );
    and print out the href attribute
    print $link->attr(q{href}), qq{\n\n};
    Rather than print out the results you could push them onto an array (say, @record) so that $record[0] would be the location, $record[1] the phone number etc..

    You can get the low down on the arrow -> in perlreftut and perlref. We use it here to call an objects method.

    Good luck!