Table information

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am trying to capture the required table information using the below code.

Can anyone please help me to provide better solution.
Script and html info are given below

#!/usr/bin/perl

my $str="";
while(<DATA>) {
    chomp($_);
    $str .= "$_";
 }

$str =~ /<td\s+valign\=\"top\"\s+width\=\"50%\">(.*)\s+<\/table>/;
$req = $1;
my @data = split(/<\/tr>/,$req);

$data[1]=~ s/<td\s+valign=\"top\"\s+nowrap><b>Address\s+:<\/b><\/td>//
+g;
$data[1]=~ /<td>(.*?)<\/td>/;
print "Address: $1\n";
$data[2]=~ s/<td\s+valign\=\"top\"\s+nowrap><b>Level\s+of\s+Office\s+:
+<\/b><\/td>//g;
$data[2]=~ /<td>(.*?)<\/td>/;
print "Level of Office: $1\n";


$data[3]=~ s/<td\s+valign\=\"top\"\s+nowrap><b>Phone\s+No\s+:<\/b><\/t
+d>//g;
$data[3]=~ /<td>(.*?)<\/td>/;
print "Phone No: $1\n";


$data[4]=~ s/<td\s+valign\=\"top\"\s+nowrap><b>Website\s+:<\/b><\/td>/
+/g;
$data[4]=~ /<td><a\s+href=[\"](.*)[\"]\s+target\=\"\_blank\"\s+class\=
+\"pglink\">(.*?)(<\/a>)<\/td>/;
print "Website: $2\n";
__DATA__
<table><tr>
    <td valign="top" width="50%">
      <table cellspacing="3" cellpadding="2" border="0" width="98%" al
+ign="center">
<!--
                <tr>
                  <td valign="top" nowrap width="30%"><b>Company Name 
+:</b></td>
                  <td width="75%">Abu Dhabi Commercial Bank (ADCB)</td
+>
                </tr>-->
                <tr>
                  <td valign="top" nowrap><b>Address :</b></td>
                  <td>Rehmat Manzil, 75, Veer Nariman Road, Churchgate
+                                    </td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Level of Office :</b></td
+>
                  <td>Head Office</td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Phone No :</b></td>
                  <td>(22) 39534100</td>
                </tr>
                                <tr>
                  <td valign="top" nowrap><b>Website :</b></td>
                  <td><a href="http://www.adcbindia.com" target="_blan
+k" class="pglink">www.adcbindia.com</a></td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Industry :</b></td>
                  <td>
                  BFSI                                    </td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Sub Industry :</b></td>
                  <td>
                  Banks                  </td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>City :</b></td>
                  <td>Mumbai</td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>State :</b></td>
                  <td>Maharashtra</td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Pin :</b></td>
                  <td>400020</td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Company Type :</b></td>
                  <td>MNC</td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>Total Turnover :</b></td>
                  <td>10-100 Crs</td>
                </tr>
                <tr>
                  <td valign="top" nowrap><b>No. of Employees :</b></t
+d>
                  <td>101-250</td>
                </tr>
                                <tr>
                  <td valign="top" nowrap><b>Sector :</b></td>
                  <td>Private Sector</td>
                </tr>
                <tr>
                <td valign="top" nowrap>&nbsp;<br><br><a href="javascr
+ipt:history.go(-1);" class="view"><img src="images/back-btn-new.jpg" 
+border="0"></a></td>
                  <td align="left" nowrap><br><br><a href='javascript:
+ excel_divpop("excel_div","Abu Dhabi Commercial Bank (ADCB)");' class
+="home_heading" ><img src="images/found-incorrect-btn.jpg" border="0"
+></a>&nbsp;                  <a href="http://crm.fundoodata.com/"><im
+g src="images/moveto-crm.jpg" border="0"></a>
                  <br><br>
                  </td>
              </tr>
                <tr>
                  <td colspan="2"><hr color="#cccccc" size="1">
                  </td>
              </tr>
                
            </table>
        </td>
</tr></table>
[download]

Thanks

Comment on Table information Download Code

Replies are listed 'Best First'.
Re: Table information by Tux (Canon) on Nov 09, 2010 at 18:03 UTC
Don't do it with regular expressions. Use HTML::TreeBuilder and go from there. `use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; { local $/; $tree->parse_content (<DATA>); } foreach my $tr ($tree->look_down (_tag => "tr")) { my @td = map { $_->as_text } $tr->look_down (_tag => "td"); @td == 2 or next; print "[0] $td[0], [1] $td[1]\n"; }` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re: Table information by aquarium (Curate) on Nov 09, 2010 at 21:44 UTC
And also to make the code much cleaner, use CSS instead of inline HTML attributes. it will make your html code much shorter and easier to read. and you can generate the CSS on the fly also (separately to the html) if you want. the hardest line to type correctly is: stty erase ^H	[reply]