Perl script not running through

lordy has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

This is my first attempt to use Perl so I'm sure my problem is something simple.

I am attempting to scrape data from realestate.com.au but am running into problems when I include a particular bit of code. For the life of me I can't understand why this is so as the code is very similar to code above it which runs perfectly. When I use HTML::TokeParser to go to a new tag, the script is freezing. However, it works perfectly well in other parts.

My code is below, the block that is causing me trouble is just below the =pod part. Basically, the first line beneath the pod (so if I move the pod below that statement) causes the script to hang.

If you look at this in depth you will see that there would be no result for the auction details, which is fine, I want to use this over multiple pages. In the case in the script the if statement will evaluate to false and all would be well... at least in my thinking.

Thanks for any light you could shed on this.

An example of an auction page is this: http://www.realestate.com.au/cgi-bin/rsearch?a=o&id=106137557

#!/usr/bin/perl
use WWW::Mechanize;
use HTML::TokeParser;
use Switch;

my $mech = WWW::Mechanize->new( autocheck => 1 );

#set url
my $url = "http://www.realestate.com.au/cgi-bin/rsearch?a=o&id=1060238
+87";

#$mech->get("http://search.cpan.org");
$mech->get($url);

#pass the stream to tokeparser
my $stream = HTML::TokeParser->new(\$mech->{content});

# go to first p tag
my $tag = $stream->get_tag("p");

# loop through p tags until we find classes
until ($tag->[1]{class} eq "officeFax") 
    {
    
    switch ($tag->[1]{class}) 
        {
        case "propertyID"        
            {
                $propid = $stream->get_trimmed_text("/p");
                
                # now get data straight after this tag
                $tag = $stream->get_tag("h1");
                $address = $stream->get_trimmed_text("/h1");
                $tag = $stream->get_tag("strong");
                    if ($tag->[1]{class} eq "price") 
                        {
                            $price = $stream->get_trimmed_text("/stron
+g");
                        }
                $tag = $stream->get_tag("h2");
                    $header = $stream->get_trimmed_text("/h2");
                $tag = $stream->get_tag("h2");
                    # make sure it's correct part of source
                    if ($tag->[1]{class} eq "propertySummary")
                        {
                            $summary = $stream->get_trimmed_text("/h2"
+);
                        }
                # Due to information not appearing all the time replic
+ate tag and stream for status
                $tag2 = $tag;
                $stream2 = $stream;
                $tag2 = $stream2->get_tag("h3");
                    # Check for under contract/offer etc
                    if ($tag2->[1]{class} eq "highlighted")
                        {
                            $status = $stream2->get_trimmed_text("/h3"
+);
                        }

                # Do the same for auction details
                $tag3 = $tag;
                $stream3 = $stream;
               
=pod
                $tag3 = $stream3->get_tag("span");

                    # Get "Price Authority" - at the moment seems to b
+e only auction
                    if ($tag3->[1]{class} eq "price authority")
                        {
                            $priceauth = $stream3->get_trimmed_text("/
+span");
                        }
                $tag3 = $stream3->get_tag("span");
                    # Get Auction time
                    if ($tag3->[1]{class} eq "price auction")
                        {
                            $auction = $stream3->get_trimmed_text("/sp
+an");
                        }
=cut
                
                # Loop down to description
                $tag = $stream->get_tag("div");
                until ($tag->[1]{class} eq "description") 
                        {
                            $tag = $stream->get_tag("div");           
+             
                        }
                $description = $stream->get_trimmed_text("/div");
                # Get Agent Name
                $tag = $stream->get_tag("div");
                until ($tag->[1]{id} eq "contactAgentDetails") 
                        {
                            $tag = $stream->get_tag("div");           
+             
                        }
                $tag = $stream->get_tag("p");
                $agent = $stream->get_trimmed_text("/p");
            }
        case "officePhone"
            {
                $officephone = $stream->get_trimmed_text("/p");
            }
        case "officeFax"
            {
                $officefax  = $stream->get_trimmed_text("/p");
            }
       }  
    # go to next p tag
    $tag = $stream->get_tag("p");
    }

# Loop down to property summary
until ($tag->[1]{id} eq "propertySummary") 
        {
            $tag = $stream->get_tag("div");                        
        }
        
$tag = $stream->get_tag("dt");
$mycat = $stream->get_trimmed_text("/dt");
# Get property summary details
until ($mycat eq "Close to:")
    {
        switch ($mycat)
            {
                case "Category:"
                    {
                        $tag = $stream->get_tag("dd");
                        $proptype = $stream->get_trimmed_text("/dd");
                    }
                case "Bedrooms:"
                    {
                        $tag = $stream->get_tag("dd");
                        $bed = $stream->get_trimmed_text("/dd");
                    }
                case "Bathrooms:"
                    {
                        $tag = $stream->get_tag("dd");
                        $bath = $stream->get_trimmed_text("/dd");
                    } 
                case "Land:"
                    {
                        $tag = $stream->get_tag("dd");
                        $land = $stream->get_trimmed_text("/dd");
                    } 
                case "Carport:"
                    {
                        $tag = $stream->get_tag("dd");
                        $carnumport = $stream->get_trimmed_text("/dd")
+;
                    }
                case "Garage:"
                    {
                        $tag = $stream->get_tag("dd");
                        $carnumgar = $stream->get_trimmed_text("/dd");
                    }
                case "Municipality:"
                    {
                        $tag = $stream->get_tag("dd");
                        $municipality = $stream->get_trimmed_text("/dd
+");
                    }
            }
        $tag = $stream->get_tag("dt");
        $mycat = $stream->get_trimmed_text("/dt");
    }
    
print "$propid \n";
print "$status \n";
print $mech->title;    
print "Address: $address \n";
print "Price: $price \n";
print "Type: $proptype \n";
print "Bedrooms: $bed \n";
print "Bathrooms: $bath \n";
if (length($carnumport)>0)
    {
        print "Carport: $carnumport \n";
    }
if (length($carnumgar)>0)
    {
        print "Garage: $carnumgar \n";
    }
print "$header \n";
print "$summary \n";
print "$description \n";
print "Agent Details: \n";
print "$agent \n";
print "$officephone \n";
print "$officefax \n";
[download]

Comment on Perl script not running through Download Code

Replies are listed 'Best First'.
Re: Perl script not running through by 7stud (Deacon) on Nov 14, 2009 at 11:56 UTC
Hi, when I include a particular bit of code. For the life of me I can't understand why this is so as the code is very similar to code above it which runs perfectly. One difference with your code: `$tag = $stream->get_tag("div");` and the code right above it, e.g.: `$tag3 = $stream3->get_tag("span");` is that your code is modifying the control variable for the main loop: `until ($tag->[1]{class} eq "officeFax")` Are you aware of that? To make trouble shooting easier, you should start deleting code. Delete everything not related to your problem. Are all those case blocks necessary? You should also probably get a fixed page of html that is fairly simple, and work on that. You should be able to whittle your code down to about 20 lines to isolate the problem.	[reply] [d/l] [select]
Re^2: Perl script not running through by lordy (Initiate) on Nov 14, 2009 at 21:35 UTC
Thanks, will do	[reply]
Re: Perl script not running through by gmargo (Hermit) on Nov 14, 2009 at 17:46 UTC
This doesn't answer your question, but... I prefer HTML::TreeBuilder for parsing HTML. The HTML page is parsed once into a tree-like structure, and then you traverse the structure (up, down, or sideways) to find the content you seek. You don't need to process the file on the fly, token by token. I took the liberty of writing a version of your code using HTML::TreeBuilder to demonstrate the difference. Read more... (7 kB)	[reply] [d/l]
Re^2: Perl script not running through by lordy (Initiate) on Nov 14, 2009 at 21:33 UTC
Hi, Thanks for that, I don't get how all of it works from the first read, but it may be just the thing to have me switch to TreeBuilder. Appreciate it!	[reply]
Re: Perl script not running through by Anonymous Monk on Nov 14, 2009 at 10:08 UTC
Without even looking past `use Switch;` I can tell it is because you're using Switch. Probably. :)	[reply] [d/l]