marcoss has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm extracting info from a website using HTML::TreeBuilder::XPath. With the help of the monks I've been able to do this from other websites with almost no complication...until now. Basically, the foreach loop is not looping through the whole table in order to extract the info from each node. It's only retrieving the first results. I have tried this in many ways, but it's always the same result no matter the route I use for the node (even copying the whole XPath route from the browser by rightclicking on it). This is the code, if you execute it and look at the sourcecode, you'll see what I mean.

#!/usr/bin/perl -w use LWP::Simple; use HTML::TreeBuilder::XPath; use Data::Dumper; use strict; my $debug=1; my $base='http://www.msccrociere.it/it_it'; my $url='/Partenza-Crociere/Trova-La-Tua-Crociera.aspx?Reg=CAR&DateF=2 +01211&ddl=n&p=1&'; my $page = get($base.$url) or die $!; my $p = HTML::TreeBuilder::XPath->new_from_content( $page ); #binmode( STDOUT, ':utf8'); my @trips= $p->findnodes( '//table[@id="tblFYCXML_Itin"]'); foreach my $trip (@trips){ my $destination = $trip->findvalue('.//h2[@class="FYCm +aneDestXML"]'); my $shipname = $trip->findvalue('.//div[@class="cConte +ntLeft"]/a/h3'); print "$destination\n"; print "$shipname\n"; }

I know I'm making a newbie mistake somewhere, like I said I've tried many different things before asking here. I hope you can give me a hand. Thanks a lot!!

Replies are listed 'Best First'.
Re: foreach my $question (@perlmonks){}
by tobyink (Canon) on Jun 19, 2012 at 09:32 UTC

    I'll give you a clue. Your problem is here:

    my @trips= $p->findnodes( '//table[@id="tblFYCXML_Itin"]');

    How many tables with id="tblFYCXML_Itin" do you expect the page to contain?

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Hi, in that page there's only 1 table with that ID...so, I'm expecting the foreach loop to show me every h2, or every div[@class="something"]/a... or whatever I need to extract, such as departure dates, ship names, prices, duration..etc. mmmmm...I still can't see where the mistake is... perhaps one more clue...? Thanks!

        But that's not what the code says... Walk through it with me.

        my @trips= $p->findnodes( '//table[@id="tblFYCXML_Itin"]'); # So there's exactly one table with that id. # So @trips contains now exactly one node, that node being that one ta +ble. # You still with me? # If not, try it: print "There is/are ", scalar(@trips), " nodes in \@trips.\n";

        Okay. And then:

        foreach my $trip (@trips){

        You see it? Look at that line again. See it now? Look again until you do.

        For each element of @trips, an array of which we just established that it has exactly one element, anyway, so for each element of that set of one element,, you want to do something. And you get a result like it runs the loop only exactly one time. Hmm, boggles the mind, don't it :)

        If, at this point, you still really need another clue? Try finding those nodes that you want to loop over, and loop over them, instead of trying to loop over something that you know only occurs once.

        Well, there's your problem!

        This is essentially your code

        my @tables = ( { 'h2' => [ .. ], 'div/a' => [ .. ], }, ); for my $table( @tables ){ my $oneh2 = $table->['h2']->[0]; my $onediv = $table->['div/a']->[0]; }

        You're asking why this doesn't look for multiple h2s or divs -- do you see why it doesn't?