Web scraping

libvenus has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!!

I have been assigned a task to scrap a website.I know i can do it using Mechanize. But, I m unable to parse the content below -


<div id="dealer_phone">9999999999</div>
<div id="dealer_managers">MR K.R.Shaw<br />Service Manager: Mr. R Leon
+<br />&nbsp;<br />&nbsp;</div></div>
<div id="hours">
<table width="240" height="140" cellspacing="0" cellpadding="2">
           <tr>
            <td colspan="3" class="td_cellheader">
                <span class="dots">::&nbsp;</span>
                        <span class="dots_cont">Hours of Operation</sp
+an>
                        <span class="dots">&nbsp;::</span>
                        </td>
               </tr>
               <tr>
            <td valign="top"><strong>&nbsp;SALES</strong></td>
            <td valign="top">Mon-Thu<br />Fri-Sat<br />Sun<br /></td>
                        <td valign="top"> 9:00AM-7:30PM&nbsp;<br /> 9:
+00AM-6:00PM&nbsp;<br /> 12:00PM-4:00PM&nbsp;<br></td>
              </tr>
</table>
</div>
[download]

Is there a way I could achieve this ? TIA !!

Comment on Web scraping Download Code

Replies are listed 'Best First'.
Re: Web scraping by Corion (Patriarch) on Jul 14, 2010 at 16:29 UTC
I guess the first step would be telling us what data you want, and showing us what code you wrote, and also telling us how your code fails to extract the data you want. Also see Web::Scraper for a different approach to extracting data from HTML.	[reply]
Re^2: Web scraping by libvenus (Sexton) on Jul 14, 2010 at 16:56 UTC
Here is the code. I m reading zipcodes from a spreadsheet and scrapping data related to that zipcode from a website. use strict; use warnings; use Spreadsheet::ParseExcel; use WWW::Mechanize; use HTML::Form; #Variable declaration my ( $zicodes_file , $website , $xls_parser , $xls_workbook , $xls_worksheet, $row_min , $row_max , $col_min , $col_max , @zip_ +codes , $mech , @zip_inputs , $input , @dealer_links ); #Variable Initialization $zicodes_file = shift or die "1st Cmd Param(zipcodes spreadsheet) Missing..Exi +ting!!"; $website = shift or die "2nd Cmd Param(honda dealer website) Missing..Exi +ting!!"; #STEP - 1 - Read in zipcodes from Zipcodes Spreadsheet $xls_parser = Spreadsheet::ParseExcel->new(); $xls_workbook = $xls_parser->parse( $zicodes_file ); die $xls_parser->error(), ".\n" if ( !defined $xls_workbook ); ##ZipCodes are in 2nd worksheet $xls_worksheet = $xls_workbook->worksheet(1); ( $row_min, $row_max ) = $xls_worksheet->row_range(); ( $col_min, $col_max ) = $xls_worksheet->col_range(); for my $row ( 1 .. $row_max ) { my $col = 0; #Zipcodes are in first column my $cell = $xls_worksheet->get_cell( $row, $col ); next unless $cell; push @zip_codes,$cell->value(); } #STEP - 2 - Read in related data for zipcodes using the website $mech = WWW::Mechanize->new(); $mech->get( $website ); die "Could not fetch $website ", $mech->status," \n" if ( !$mech->success ); $mech->form_name( 'searchdealer' ); @zip_inputs = $mech->find_all_inputs( type => 'text', id => 'searchform_txt_zip', ); #testing with only one zip code $input = $zip_inputs[0]; $input->value( $zip_codes[ 0 ] ); $mech->submit; @dealer_links = $mech->find_all_links( url_regex => qr/results.+?dealer\=\d+$/i, ); $mech->get( $dealer_links[0]->[0] ); print $mech->content; [download] I don't see any method in Mechanize that can help me parse the data in div html tags.	[reply] [d/l]