Hi,

This is my first attempt to use Perl so I'm sure my problem is something simple.

I am attempting to scrape data from realestate.com.au but am running into problems when I include a particular bit of code. For the life of me I can't understand why this is so as the code is very similar to code above it which runs perfectly. When I use HTML::TokeParser to go to a new tag, the script is freezing. However, it works perfectly well in other parts.

My code is below, the block that is causing me trouble is just below the =pod part. Basically, the first line beneath the pod (so if I move the pod below that statement) causes the script to hang.

If you look at this in depth you will see that there would be no result for the auction details, which is fine, I want to use this over multiple pages. In the case in the script the if statement will evaluate to false and all would be well... at least in my thinking.

Thanks for any light you could shed on this.

An example of an auction page is this: http://www.realestate.com.au/cgi-bin/rsearch?a=o&id=106137557

#!/usr/bin/perl use WWW::Mechanize; use HTML::TokeParser; use Switch; my $mech = WWW::Mechanize->new( autocheck => 1 ); #set url my $url = "http://www.realestate.com.au/cgi-bin/rsearch?a=o&id=1060238 +87"; #$mech->get("http://search.cpan.org"); $mech->get($url); #pass the stream to tokeparser my $stream = HTML::TokeParser->new(\$mech->{content}); # go to first p tag my $tag = $stream->get_tag("p"); # loop through p tags until we find classes until ($tag->[1]{class} eq "officeFax") { switch ($tag->[1]{class}) { case "propertyID" { $propid = $stream->get_trimmed_text("/p"); # now get data straight after this tag $tag = $stream->get_tag("h1"); $address = $stream->get_trimmed_text("/h1"); $tag = $stream->get_tag("strong"); if ($tag->[1]{class} eq "price") { $price = $stream->get_trimmed_text("/stron +g"); } $tag = $stream->get_tag("h2"); $header = $stream->get_trimmed_text("/h2"); $tag = $stream->get_tag("h2"); # make sure it's correct part of source if ($tag->[1]{class} eq "propertySummary") { $summary = $stream->get_trimmed_text("/h2" +); } # Due to information not appearing all the time replic +ate tag and stream for status $tag2 = $tag; $stream2 = $stream; $tag2 = $stream2->get_tag("h3"); # Check for under contract/offer etc if ($tag2->[1]{class} eq "highlighted") { $status = $stream2->get_trimmed_text("/h3" +); } # Do the same for auction details $tag3 = $tag; $stream3 = $stream; =pod $tag3 = $stream3->get_tag("span"); # Get "Price Authority" - at the moment seems to b +e only auction if ($tag3->[1]{class} eq "price authority") { $priceauth = $stream3->get_trimmed_text("/ +span"); } $tag3 = $stream3->get_tag("span"); # Get Auction time if ($tag3->[1]{class} eq "price auction") { $auction = $stream3->get_trimmed_text("/sp +an"); } =cut # Loop down to description $tag = $stream->get_tag("div"); until ($tag->[1]{class} eq "description") { $tag = $stream->get_tag("div"); + } $description = $stream->get_trimmed_text("/div"); # Get Agent Name $tag = $stream->get_tag("div"); until ($tag->[1]{id} eq "contactAgentDetails") { $tag = $stream->get_tag("div"); + } $tag = $stream->get_tag("p"); $agent = $stream->get_trimmed_text("/p"); } case "officePhone" { $officephone = $stream->get_trimmed_text("/p"); } case "officeFax" { $officefax = $stream->get_trimmed_text("/p"); } } # go to next p tag $tag = $stream->get_tag("p"); } # Loop down to property summary until ($tag->[1]{id} eq "propertySummary") { $tag = $stream->get_tag("div"); } $tag = $stream->get_tag("dt"); $mycat = $stream->get_trimmed_text("/dt"); # Get property summary details until ($mycat eq "Close to:") { switch ($mycat) { case "Category:" { $tag = $stream->get_tag("dd"); $proptype = $stream->get_trimmed_text("/dd"); } case "Bedrooms:" { $tag = $stream->get_tag("dd"); $bed = $stream->get_trimmed_text("/dd"); } case "Bathrooms:" { $tag = $stream->get_tag("dd"); $bath = $stream->get_trimmed_text("/dd"); } case "Land:" { $tag = $stream->get_tag("dd"); $land = $stream->get_trimmed_text("/dd"); } case "Carport:" { $tag = $stream->get_tag("dd"); $carnumport = $stream->get_trimmed_text("/dd") +; } case "Garage:" { $tag = $stream->get_tag("dd"); $carnumgar = $stream->get_trimmed_text("/dd"); } case "Municipality:" { $tag = $stream->get_tag("dd"); $municipality = $stream->get_trimmed_text("/dd +"); } } $tag = $stream->get_tag("dt"); $mycat = $stream->get_trimmed_text("/dt"); } print "$propid \n"; print "$status \n"; print $mech->title; print "Address: $address \n"; print "Price: $price \n"; print "Type: $proptype \n"; print "Bedrooms: $bed \n"; print "Bathrooms: $bath \n"; if (length($carnumport)>0) { print "Carport: $carnumport \n"; } if (length($carnumgar)>0) { print "Garage: $carnumgar \n"; } print "$header \n"; print "$summary \n"; print "$description \n"; print "Agent Details: \n"; print "$agent \n"; print "$officephone \n"; print "$officefax \n";

In reply to Perl script not running through by lordy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.