jtravillian has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use perl package HTML::TableExtract to manipulate/cleanse an html-tagged data file sent to me from my vendor's reporting system. The data that actually comes to me is long and complex, but I have both business-cleansed and simplified the input data to present it here.

The tagged data comes to me in a file with an XLS extension, but I have copied the file and changed the new file extension to shtml.

Some input lines with the html have merged fields with the line above it. However, both Excel and IE render the respective data files as they are intended to look.

HTML::TableExtract seems to get confused (or I have bad code) on an input line following two lines with merged cells. Interestingly, it handles the merged lines fine. But on the line following the merged lines, the package seems to not parse correctly.

my $te = HTML::TableExtract->new( headers => [(@column_names)], keep_h +eaders => 1 ); $te->parse_file($inFilename_long); foreach my $ts ($te->tables) { print "\nLine 0 ", join(', ',$ts->row(0)); print " <--- Header row +"; print "\nLine 1 ", join(', ',$ts->row(1)); print "\nLine 2 ", join(', ',$ts->row(2)); print "\nLine 3 ", join(', ',$ts->row(3)); print " <-- This has me +rged cells with the line above it and correctly pushes to the right." +; print "\nLine 4 ", join(', ',$ts->row(4)); print " <--- Why are th +ese column values being pushed to the right? Also data is lost."; print "\n"; }


Produces output:
Line 0 Column_1, Asset Tag, Washed Number, Asset Name, Cust Code, Prim +ary IP Address <--- Header row Line 1 R2Col1, LEN2222, RC0, LEN2222, , 1.1.1.55 Line 2 R3Col1, L3333, 090, TSMWAL, , 1.1.1.137 Line 3 , , , , R4Col4, 1.1.1.137 <-- This has merged cells with the l +ine above it and correctly pushes to the right. Line 4 , , , , R5Col1, TSH5555 <--- Why are these column values being + pushed to the right? Also data is lost.
Here is the part of interest in the input file in text:
<body link=3Dblue vlink=3Dpurple> <table x:str border=3D0 cellpadding=3D0 cellspacing=3D0> <col style=3D'width:53pt'> <col style=3D'width:58pt'> <col style=3D'width:91pt'> <col style=3D'width:68pt'> <col style=3D'width:75pt'> <col style=3D'width:75pt'> <tr height=3D23 style=3D'height:22.5pt'> <td colspan=3D31 style=3D'font-family:tahoma;font-size:18.0pt'>W Poten +tial Under</td></tr> <tr style=3D'height:10.5pt'><td></td></tr> <tr style=3D'height:10.5pt'><td colspan=3D31 style=3D'font-family:taho +ma;font-size:8.0pt;font-weight:700'>Page by:</td></tr><tr style=3D'he +ight:10.5pt'><td colspan=3D31 style=3D'font-family:tahoma;font-size:8 +.0pt'>Tenant Data Partition: W</td></tr><tr style=3D'height:10.5pt'>< +td></td></tr> <tr> <td class=3Dxl34>Column_1</td> <td class=3Dxl35>Asset Tag</td> <td class=3Dxl35>Washed Number</td> <td class=3Dxl35>Asset Name</td> <td class=3Dxl35>Cust Code</td> <td class=3Dxl35>Primary IP Address</td> </tr> <tr> <td class=3Dxl32 x:num=3D"7258807">R2Col1</td> <td class=3Dxl29>LEN2222</td> <td class=3Dxl29>RC0</td> <td class=3Dxl29>LEN2222</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.55</td> </tr> <tr> <td rowspan=3D2 class=3Dxl32 x:num=3D"7258830">R3Col1</td> <td rowspan=3D2 class=3Dxl29>L3333</td> <td rowspan=3D2 class=3Dxl29>090</td> <td rowspan=3D2 class=3Dxl29>TSMWAL</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.137</td> </tr> <tr> <td class=3Dxl29>R4Col4</td> <td class=3Dxl29>1.1.1.137</td> </tr> <tr> <td class=3Dxl32 x:num=3D"7258831">R5Col1</td> <td class=3Dxl29>TSH5555</td> <td class=3Dxl29>4H</td> <td class=3Dxl29>TSH5555 this data gets dropped but should not</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.69</td> </tr> <tr> <td rowspan=3D2 class=3Dxl32 x:num=3D"7258844">R6Col1</td> <td rowspan=3D2 class=3Dxl29>146666</td> <td rowspan=3D2 class=3Dxl29>2-0</td> <td rowspan=3D2 class=3Dxl29>TSM</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.11</td> </tr> <tr> <td class=3Dxl29>R7Col4</td> <td class=3Dxl29>1.1.1.11</td> </tr> </table> </body>

Replies are listed 'Best First'.
Re: HTML::TableExtract problem handling merged cells across rows (OBO rowspan colspan)
by Anonymous Monk on Feb 26, 2015 at 04:44 UTC

    You have an off-by-one error in your data, your expectations are wrong

    This is actually row 5 column 4 not column 1  <td class=3Dxl32 x:num=3D"7258831">R5Col1</td>

    thus this is column 7 so its dropped <td class=3Dxl29>TSH5555 this data gets dropped but should not</td>

    So fix your data or work around it or something else entirely

    Here is how I found out

    head0 head1 head2 head3
    0/0 colspan=4
    0/0 undef undef undef
    "0/0" "0/0" "0/0" "0/0"
    1/0 rowspan=2
    1/0 1/1 1/2 1/3
    1/1 1/2 1/3
    2/1 colspan=2 rowspan=3
    undef 2/1 undef 2/3
    "1/0" "2/1" "2/1" "2/3"
    2/3
    3/0
    3/0 undef undef 3/3
    "3/0" "2/1" "2/1" "3/3"
    3/3
    4/0
    4/0 undef undef 4/3
    "4/0" "2/1" "2/1" "4/3"
    4/3
    5/0 colspan=2
    5/0 undef 5/2 5/3
    "5/0" "5/0" "5/2" "5/3"
    5/2 5/3
    TABLE(0, 0): head0,head1,head2,head3 0/0 colspan=4 ,,, 1/0 rowspan=2 , 1/1 , 1/2 , 1/3 , 2/1 colspan=2 rowspan=3 ,, 2/3 3/0 ,,, 3/3 4/0 ,,, 4/3 5/0 colspan=2 ,, 5/2 , 5/3 ( ["head0", "head1", "head2", "head3"], [" 0/0 colspan=4 ", undef, undef, undef], [" 1/0 rowspan=2 ", " 1/1 ", " 1/2 ", " 1/3 "], [undef, " 2/1 colspan=2 rowspan=3 ", undef, " 2/3 "], [" 3/0 ", undef, undef, " 3/3 "], [" 4/0 ", undef, undef, " 4/3 "], [" 5/0 colspan=2 ", undef, " 5/2 ", " 5/3 "], )

      Thank you for the response. You obviously spent some time looking at this.

      Unfortunately, I have little or no control over the incoming data. It's from my vendor's reporting system (aka my vendor's vendor) so I can't really change it.

      After reading your post, I investigated further. MS-Excel and I.E. (v8.0) both render the table data the way that the report writer intended. Everything is in the "correct and legible" column.

      But Firefox (v31.4) and Chrome (v40) (and apparently perl) all render the table data differently from MS-Excel and I.E. At least to the reader, it would be considered "incorrectly and not legible".

      I use "correct" and "incorrectly" as loose terms since I'm not intimately familiar with the HTML standards.

      Since the report comes to me as tagged table in a text file, is there a better way I should be parsing this data other than using HTML::TableExtract?

        Not sure if this helps but I 'cleaned' up the tags with this regex to remove the 3D's

        s/(class|rowspan|style|colspan)=3D/$1=/g;

        #!perl use strict; use HTML::TableExtract; my $infile = 'test.htm'; open IN,'<',$infile or die "$!"; open OUT,'>','clean.htm' or die "$!"; my $html; while (<IN>){ s/(class|rowspan|style|colspan)=3D/$1=/g; print OUT $_; $html .= $_; } my @col = ('Column_1','Asset Tag','Washed Number','Asset Name','Cust C +ode','Primary IP Address'); my $te = HTML::TableExtract->new( headers=>[@col],keep_headers => 1 ) +; $te->parse( $html ); foreach my $ts ($te->tables) { print "\nLine 0 ", join(', ',$ts->row(0)); print "\nLine 1 ", join(', ',$ts->row(1)); print "\nLine 2 ", join(', ',$ts->row(2)); print "\nLine 3 ", join(', ',$ts->row(3)); print "\nLine 4 ", join(', ',$ts->row(4)); print "\n"; }
        poj
Re: HTML::TableExtract problem handling merged cells across rows
by Anonymous Monk on Feb 24, 2015 at 20:19 UTC

    The tagged data comes to me in a file with an XLS extension, but I have copied the file and changed the new file extension to shtml.

    Why? Why not try xls ?

      Thanks for your response. My reason for not trying xls now is because I already have Table::Extract working in this same perl file to transform/clean 6 other reports. None of the those reports, however, contain "merged" cells within the table. I'd really like to get the HTML::TableExtract working. Not sure if its my code, but it feels like a bug within HTML::TableExtract.