lordy has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

This is my first attempt to use Perl so I'm sure my problem is something simple.

I am attempting to scrape data from realestate.com.au but am running into problems when I include a particular bit of code. For the life of me I can't understand why this is so as the code is very similar to code above it which runs perfectly. When I use HTML::TokeParser to go to a new tag, the script is freezing. However, it works perfectly well in other parts.

My code is below, the block that is causing me trouble is just below the =pod part. Basically, the first line beneath the pod (so if I move the pod below that statement) causes the script to hang.

If you look at this in depth you will see that there would be no result for the auction details, which is fine, I want to use this over multiple pages. In the case in the script the if statement will evaluate to false and all would be well... at least in my thinking.

Thanks for any light you could shed on this.

An example of an auction page is this: http://www.realestate.com.au/cgi-bin/rsearch?a=o&id=106137557

#!/usr/bin/perl use WWW::Mechanize; use HTML::TokeParser; use Switch; my $mech = WWW::Mechanize->new( autocheck => 1 ); #set url my $url = "http://www.realestate.com.au/cgi-bin/rsearch?a=o&id=1060238 +87"; #$mech->get("http://search.cpan.org"); $mech->get($url); #pass the stream to tokeparser my $stream = HTML::TokeParser->new(\$mech->{content}); # go to first p tag my $tag = $stream->get_tag("p"); # loop through p tags until we find classes until ($tag->[1]{class} eq "officeFax") { switch ($tag->[1]{class}) { case "propertyID" { $propid = $stream->get_trimmed_text("/p"); # now get data straight after this tag $tag = $stream->get_tag("h1"); $address = $stream->get_trimmed_text("/h1"); $tag = $stream->get_tag("strong"); if ($tag->[1]{class} eq "price") { $price = $stream->get_trimmed_text("/stron +g"); } $tag = $stream->get_tag("h2"); $header = $stream->get_trimmed_text("/h2"); $tag = $stream->get_tag("h2"); # make sure it's correct part of source if ($tag->[1]{class} eq "propertySummary") { $summary = $stream->get_trimmed_text("/h2" +); } # Due to information not appearing all the time replic +ate tag and stream for status $tag2 = $tag; $stream2 = $stream; $tag2 = $stream2->get_tag("h3"); # Check for under contract/offer etc if ($tag2->[1]{class} eq "highlighted") { $status = $stream2->get_trimmed_text("/h3" +); } # Do the same for auction details $tag3 = $tag; $stream3 = $stream; =pod $tag3 = $stream3->get_tag("span"); # Get "Price Authority" - at the moment seems to b +e only auction if ($tag3->[1]{class} eq "price authority") { $priceauth = $stream3->get_trimmed_text("/ +span"); } $tag3 = $stream3->get_tag("span"); # Get Auction time if ($tag3->[1]{class} eq "price auction") { $auction = $stream3->get_trimmed_text("/sp +an"); } =cut # Loop down to description $tag = $stream->get_tag("div"); until ($tag->[1]{class} eq "description") { $tag = $stream->get_tag("div"); + } $description = $stream->get_trimmed_text("/div"); # Get Agent Name $tag = $stream->get_tag("div"); until ($tag->[1]{id} eq "contactAgentDetails") { $tag = $stream->get_tag("div"); + } $tag = $stream->get_tag("p"); $agent = $stream->get_trimmed_text("/p"); } case "officePhone" { $officephone = $stream->get_trimmed_text("/p"); } case "officeFax" { $officefax = $stream->get_trimmed_text("/p"); } } # go to next p tag $tag = $stream->get_tag("p"); } # Loop down to property summary until ($tag->[1]{id} eq "propertySummary") { $tag = $stream->get_tag("div"); } $tag = $stream->get_tag("dt"); $mycat = $stream->get_trimmed_text("/dt"); # Get property summary details until ($mycat eq "Close to:") { switch ($mycat) { case "Category:" { $tag = $stream->get_tag("dd"); $proptype = $stream->get_trimmed_text("/dd"); } case "Bedrooms:" { $tag = $stream->get_tag("dd"); $bed = $stream->get_trimmed_text("/dd"); } case "Bathrooms:" { $tag = $stream->get_tag("dd"); $bath = $stream->get_trimmed_text("/dd"); } case "Land:" { $tag = $stream->get_tag("dd"); $land = $stream->get_trimmed_text("/dd"); } case "Carport:" { $tag = $stream->get_tag("dd"); $carnumport = $stream->get_trimmed_text("/dd") +; } case "Garage:" { $tag = $stream->get_tag("dd"); $carnumgar = $stream->get_trimmed_text("/dd"); } case "Municipality:" { $tag = $stream->get_tag("dd"); $municipality = $stream->get_trimmed_text("/dd +"); } } $tag = $stream->get_tag("dt"); $mycat = $stream->get_trimmed_text("/dt"); } print "$propid \n"; print "$status \n"; print $mech->title; print "Address: $address \n"; print "Price: $price \n"; print "Type: $proptype \n"; print "Bedrooms: $bed \n"; print "Bathrooms: $bath \n"; if (length($carnumport)>0) { print "Carport: $carnumport \n"; } if (length($carnumgar)>0) { print "Garage: $carnumgar \n"; } print "$header \n"; print "$summary \n"; print "$description \n"; print "Agent Details: \n"; print "$agent \n"; print "$officephone \n"; print "$officefax \n";

Replies are listed 'Best First'.
Re: Perl script not running through
by 7stud (Deacon) on Nov 14, 2009 at 11:56 UTC

    Hi,

    when I include a particular bit of code. For the life of me I can't understand why this is so as the code is very similar to code above it which runs perfectly.

    One difference with your code:

    $tag = $stream->get_tag("div");

    and the code right above it, e.g.:

    $tag3 = $stream3->get_tag("span");

    is that your code is modifying the control variable for the main loop:

    until ($tag->[1]{class} eq "officeFax")

    Are you aware of that?

    To make trouble shooting easier, you should start deleting code. Delete everything not related to your problem. Are all those case blocks necessary? You should also probably get a fixed page of html that is fairly simple, and work on that. You should be able to whittle your code down to about 20 lines to isolate the problem.

      Thanks, will do

Re: Perl script not running through
by gmargo (Hermit) on Nov 14, 2009 at 17:46 UTC

    This doesn't answer your question, but...

    I prefer HTML::TreeBuilder for parsing HTML. The HTML page is parsed once into a tree-like structure, and then you traverse the structure (up, down, or sideways) to find the content you seek. You don't need to process the file on the fly, token by token.

    I took the liberty of writing a version of your code using HTML::TreeBuilder to demonstrate the difference.

      Hi,

      Thanks for that, I don't get how all of it works from the first read, but it may be just the thing to have me switch to TreeBuilder.

      Appreciate it!

Re: Perl script not running through
by Anonymous Monk on Nov 14, 2009 at 10:08 UTC
    Without even looking past use Switch; I can tell it is because you're using Switch. Probably. :)