zoya has asked for the wisdom of the Perl Monks concerning the following question:

Hi all can anyone tell me i have to extract table and table contents from html using HTML::TableExtract below is my html when i apply tableextract on it ,it is just giving me the 1st column like Organisms, Moleculartype İt should give me Organism: Human etc. Below is the part of html
<TABLE border="0"><TR valign="top"><td ><TABLE id="RefSNP" cellpadd +ing="2" width="350" ><TH class="text10" bgcolor="#ccccff" align="cen +ter" colspan="2">RefSNP</TH><TR ><td class="text10" bgcolor="#f1f1f1 +" align="right"><strong>Organism:</strong></td><td class="text10" bg +color="#f1f1f1">human (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/ +Browser/wwwtax.cgi?mode=Info&id=9606"><em>Homo sapiens</em></a>)</td> +</TR><TR
use HTML::TableExtract; my $te = HTML::TableExtract->new( keep_html=>1, headers =>[qw(RefSNP)]); my $file = "Reference SNP(refSNP) Cluster Report rs111.htm"; my $document = do { local $/ = undef; open my $fh, "<", $file or die "could not open $file: $!"; <$fh>; }; $te->parse( $document); for my $ts($te->tables) { print "Table(",join(',',$ts->coords),":\n"; for my $row ($ts->rows) { for my $cell (@$row) { next unless $cell; $cell =~ s/<\/B>&nbsp;//i; print $cell."\n"; } } }
This is my code kindly plz anybody help

Replies are listed 'Best First'.
Re: Perl, HTML::TableExtract
by NetWallah (Canon) on Apr 28, 2013 at 14:55 UTC
    Disabling "slice_columns" does the trick:
    my $te = HTML::TableExtract->new( keep_html=>1, headers =>[qw(RefSNP)], slice_columns=> 0);
    From the doc:
    Enabled by default, this option controls whether vertical slices are returned from under headers that match. When disabled, all columns of the matching table are retained, regardles of whether they had a matching header above them. Disabling this also disables automap.

    New output:

    Table(1,0: <strong>Organism:</strong> human (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cg +i?mode=Info&id=9606"><em>Homo sapiens</em></a>)

                 "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'"
            -- Dr. Cox, Scrubs

Re: Perl, HTML::TableExtract
by hdb (Monsignor) on Apr 28, 2013 at 14:52 UTC

    Instead of choosing the table by header, try to chose by attrib:

    # headers =>[qw(RefSNP)]); attribs => { id => "RefSNP" } );

    For me this worked, when applying it to the source of

    http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs111

    The output still contains a lot of HTML.

Re: Perl, HTML::TableExtract
by Khen1950fx (Canon) on Apr 28, 2013 at 15:49 UTC
    I ran into some problems when I used your html. I changed to XHTML:
    <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head/> <body> <table border="0"> <tbody> <tr valign="top"> <td> <table cellpadding="2" id="RefSNP" width="350"> <tbody> <tr> <th align="center" bgcolor="#ccccff" class="text10" +colspan="2">RefSNP</th> </tr> <tr> <td align="right" bgcolor="#f1f1f1" class="text10"> <strong>Organism:</strong> </td> <td bgcolor="#f1f1f1" class="text10">human (<a href= +"http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&am +p;id=9606"><em>Homo sapiens</em></a>)</td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </body> </html>
    In addition to the advice that hdb and NetWallah gave you, I think that you want to set keep_html to 0:
    #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; my $file = '/root/Desktop/html.htm'; my $te = 'HTML::TableExtract'->new( keep_html => 0, attribs => { id => 'RefSNP' }, ); $te->parse_file($file); my $document = do { local $/ = undef; die "could not open ${file}: $!" unless open my $fh, '<', $file; <$fh>; }; $te->parse($document); foreach my $ts ( $te->tables ) { print 'Table(', join( ',', $ts->coords ), ":\n"; foreach my $row ( $ts->rows ) { foreach my $cell (@$row) { next unless $cell; $cell =~ s[</B>&nbsp;][]i; print $cell . "\n"; } } }