Perl, HTML::TableExtract

zoya has asked for the wisdom of the Perl Monks concerning the following question:

Hi all can anyone tell me i have to extract table and table contents from html using HTML::TableExtract below is my html when i apply tableextract on it ,it is just giving me the 1st column like Organisms, Moleculartype İt should give me Organism: Human etc. Below is the part of html

<TABLE  border="0"><TR  valign="top"><td ><TABLE  id="RefSNP" cellpadd
+ing="2" width="350" ><TH  class="text10" bgcolor="#ccccff" align="cen
+ter" colspan="2">RefSNP</TH><TR ><td  class="text10" bgcolor="#f1f1f1
+" align="right"><strong>Organism:</strong></td><td  class="text10" bg
+color="#f1f1f1">human (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/
+Browser/wwwtax.cgi?mode=Info&id=9606"><em>Homo sapiens</em></a>)</td>
+</TR><TR
[download]

use HTML::TableExtract;
my $te = HTML::TableExtract->new(
    keep_html=>1,
    headers =>[qw(RefSNP)]);
my $file = "Reference SNP(refSNP) Cluster Report  rs111.htm";
my $document = do {
    local $/ = undef;
    open my $fh, "<", $file
        or die "could not open $file: $!";
    <$fh>;
};
$te->parse( $document);

for my $ts($te->tables)
{
    print "Table(",join(',',$ts->coords),":\n";
    for my $row ($ts->rows)
    {
        for my $cell (@$row)
        {
            next unless $cell;
                    
            $cell =~ s/<\/B>&nbsp;//i;
            print $cell."\n";
        }
    }
}
[download]

This is my code kindly plz anybody help

Comment on Perl, HTML::TableExtract Select or Download Code

Replies are listed 'Best First'.
Re: Perl, HTML::TableExtract by NetWallah (Canon) on Apr 28, 2013 at 14:55 UTC
Disabling "slice_columns" does the trick: `my $te = HTML::TableExtract->new( keep_html=>1, headers =>[qw(RefSNP)], slice_columns=> 0);` [download] From the doc: Enabled by default, this option controls whether vertical slices are returned from under headers that match. When disabled, all columns of the matching table are retained, regardles of whether they had a matching header above them. Disabling this also disables automap. New output: `Table(1,0: <strong>Organism:</strong> human (<a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cg +i?mode=Info&id=9606"><em>Homo sapiens</em></a>)` [download] "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'" -- Dr. Cox, Scrubs	[reply] [d/l] [select]
Re: Perl, HTML::TableExtract by hdb (Monsignor) on Apr 28, 2013 at 14:52 UTC
Instead of choosing the table by header, try to chose by attrib: `# headers =>[qw(RefSNP)]); attribs => { id => "RefSNP" } );` [download] For me this worked, when applying it to the source of `http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs111` [download] The output still contains a lot of HTML.	[reply] [d/l] [select]
Re: Perl, HTML::TableExtract by Khen1950fx (Canon) on Apr 28, 2013 at 15:49 UTC
I ran into some problems when I used your html. I changed to XHTML: <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head/> <body> <table border="0"> <tbody> <tr valign="top"> <td> <table cellpadding="2" id="RefSNP" width="350"> <tbody> <tr> <th align="center" bgcolor="#ccccff" class="text10" +colspan="2">RefSNP</th> </tr> <tr> <td align="right" bgcolor="#f1f1f1" class="text10"> <strong>Organism:</strong> </td> <td bgcolor="#f1f1f1" class="text10">human (<a href= +"http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&am +p;id=9606"><em>Homo sapiens</em></a>)</td> </tr> </tbody> </table> </td> </tr> </tbody> </table> </body> </html> [download] In addition to the advice that hdb and NetWallah gave you, I think that you want to set keep_html to 0: #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; my $file = '/root/Desktop/html.htm'; my $te = 'HTML::TableExtract'->new( keep_html => 0, attribs => { id => 'RefSNP' }, ); $te->parse_file($file); my $document = do { local $/ = undef; die "could not open ${file}: $!" unless open my $fh, '<', $file; <$fh>; }; $te->parse($document); foreach my $ts ( $te->tables ) { print 'Table(', join( ',', $ts->coords ), ":\n"; foreach my $row ( $ts->rows ) { foreach my $cell (@$row) { next unless $cell; $cell =~ s[</B> ][]i; print $cell . "\n"; } } } [download]	[reply] [d/l] [select]