Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am trying to capture the required table information using the below code.

Can anyone please help me to provide better solution.
Script and html info are given below

#!/usr/bin/perl my $str=""; while(<DATA>) { chomp($_); $str .= "$_"; } $str =~ /<td\s+valign\=\"top\"\s+width\=\"50%\">(.*)\s+<\/table>/; $req = $1; my @data = split(/<\/tr>/,$req); $data[1]=~ s/<td\s+valign=\"top\"\s+nowrap><b>Address\s+:<\/b><\/td>// +g; $data[1]=~ /<td>(.*?)<\/td>/; print "Address: $1\n"; $data[2]=~ s/<td\s+valign\=\"top\"\s+nowrap><b>Level\s+of\s+Office\s+: +<\/b><\/td>//g; $data[2]=~ /<td>(.*?)<\/td>/; print "Level of Office: $1\n"; $data[3]=~ s/<td\s+valign\=\"top\"\s+nowrap><b>Phone\s+No\s+:<\/b><\/t +d>//g; $data[3]=~ /<td>(.*?)<\/td>/; print "Phone No: $1\n"; $data[4]=~ s/<td\s+valign\=\"top\"\s+nowrap><b>Website\s+:<\/b><\/td>/ +/g; $data[4]=~ /<td><a\s+href=[\"](.*)[\"]\s+target\=\"\_blank\"\s+class\= +\"pglink\">(.*?)(<\/a>)<\/td>/; print "Website: $2\n"; __DATA__ <table><tr> <td valign="top" width="50%"> <table cellspacing="3" cellpadding="2" border="0" width="98%" al +ign="center"> <!-- <tr> <td valign="top" nowrap width="30%"><b>Company Name +:</b></td> <td width="75%">Abu Dhabi Commercial Bank (ADCB)</td +> </tr>--> <tr> <td valign="top" nowrap><b>Address :</b></td> <td>Rehmat Manzil, 75, Veer Nariman Road, Churchgate + </td> </tr> <tr> <td valign="top" nowrap><b>Level of Office :</b></td +> <td>Head Office</td> </tr> <tr> <td valign="top" nowrap><b>Phone No :</b></td> <td>(22) 39534100</td> </tr> <tr> <td valign="top" nowrap><b>Website :</b></td> <td><a href="http://www.adcbindia.com" target="_blan +k" class="pglink">www.adcbindia.com</a></td> </tr> <tr> <td valign="top" nowrap><b>Industry :</b></td> <td> BFSI </td> </tr> <tr> <td valign="top" nowrap><b>Sub Industry :</b></td> <td> Banks </td> </tr> <tr> <td valign="top" nowrap><b>City :</b></td> <td>Mumbai</td> </tr> <tr> <td valign="top" nowrap><b>State :</b></td> <td>Maharashtra</td> </tr> <tr> <td valign="top" nowrap><b>Pin :</b></td> <td>400020</td> </tr> <tr> <td valign="top" nowrap><b>Company Type :</b></td> <td>MNC</td> </tr> <tr> <td valign="top" nowrap><b>Total Turnover :</b></td> <td>10-100 Crs</td> </tr> <tr> <td valign="top" nowrap><b>No. of Employees :</b></t +d> <td>101-250</td> </tr> <tr> <td valign="top" nowrap><b>Sector :</b></td> <td>Private Sector</td> </tr> <tr> <td valign="top" nowrap>&nbsp;<br><br><a href="javascr +ipt:history.go(-1);" class="view"><img src="images/back-btn-new.jpg" +border="0"></a></td> <td align="left" nowrap><br><br><a href='javascript: + excel_divpop("excel_div","Abu Dhabi Commercial Bank (ADCB)");' class +="home_heading" ><img src="images/found-incorrect-btn.jpg" border="0" +></a>&nbsp; <a href="http://crm.fundoodata.com/"><im +g src="images/moveto-crm.jpg" border="0"></a> <br><br> </td> </tr> <tr> <td colspan="2"><hr color="#cccccc" size="1"> </td> </tr> </table> </td> </tr></table>
Thanks

Replies are listed 'Best First'.
Re: Table information
by Tux (Canon) on Nov 09, 2010 at 18:03 UTC

    Don't do it with regular expressions. Use HTML::TreeBuilder and go from there.

    use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; { local $/; $tree->parse_content (<DATA>); } foreach my $tr ($tree->look_down (_tag => "tr")) { my @td = map { $_->as_text } $tr->look_down (_tag => "td"); @td == 2 or next; print "[0] $td[0], [1] $td[1]\n"; }

    Enjoy, Have FUN! H.Merijn
Re: Table information
by aquarium (Curate) on Nov 09, 2010 at 21:44 UTC
    And also to make the code much cleaner, use CSS instead of inline HTML attributes. it will make your html code much shorter and easier to read. and you can generate the CSS on the fly also (separately to the html) if you want.
    the hardest line to type correctly is: stty erase ^H