I am trying to use perl package HTML::TableExtract to manipulate/cleanse an html-tagged data file sent to me from my vendor's reporting system. The data that actually comes to me is long and complex, but I have both business-cleansed and simplified the input data to present it here.

The tagged data comes to me in a file with an XLS extension, but I have copied the file and changed the new file extension to shtml.

Some input lines with the html have merged fields with the line above it. However, both Excel and IE render the respective data files as they are intended to look.

HTML::TableExtract seems to get confused (or I have bad code) on an input line following two lines with merged cells. Interestingly, it handles the merged lines fine. But on the line following the merged lines, the package seems to not parse correctly.

my $te = HTML::TableExtract->new( headers => [(@column_names)], keep_h +eaders => 1 ); $te->parse_file($inFilename_long); foreach my $ts ($te->tables) { print "\nLine 0 ", join(', ',$ts->row(0)); print " <--- Header row +"; print "\nLine 1 ", join(', ',$ts->row(1)); print "\nLine 2 ", join(', ',$ts->row(2)); print "\nLine 3 ", join(', ',$ts->row(3)); print " <-- This has me +rged cells with the line above it and correctly pushes to the right." +; print "\nLine 4 ", join(', ',$ts->row(4)); print " <--- Why are th +ese column values being pushed to the right? Also data is lost."; print "\n"; }


Produces output:
Line 0 Column_1, Asset Tag, Washed Number, Asset Name, Cust Code, Prim +ary IP Address <--- Header row Line 1 R2Col1, LEN2222, RC0, LEN2222, , 1.1.1.55 Line 2 R3Col1, L3333, 090, TSMWAL, , 1.1.1.137 Line 3 , , , , R4Col4, 1.1.1.137 <-- This has merged cells with the l +ine above it and correctly pushes to the right. Line 4 , , , , R5Col1, TSH5555 <--- Why are these column values being + pushed to the right? Also data is lost.
Here is the part of interest in the input file in text:
<body link=3Dblue vlink=3Dpurple> <table x:str border=3D0 cellpadding=3D0 cellspacing=3D0> <col style=3D'width:53pt'> <col style=3D'width:58pt'> <col style=3D'width:91pt'> <col style=3D'width:68pt'> <col style=3D'width:75pt'> <col style=3D'width:75pt'> <tr height=3D23 style=3D'height:22.5pt'> <td colspan=3D31 style=3D'font-family:tahoma;font-size:18.0pt'>W Poten +tial Under</td></tr> <tr style=3D'height:10.5pt'><td></td></tr> <tr style=3D'height:10.5pt'><td colspan=3D31 style=3D'font-family:taho +ma;font-size:8.0pt;font-weight:700'>Page by:</td></tr><tr style=3D'he +ight:10.5pt'><td colspan=3D31 style=3D'font-family:tahoma;font-size:8 +.0pt'>Tenant Data Partition: W</td></tr><tr style=3D'height:10.5pt'>< +td></td></tr> <tr> <td class=3Dxl34>Column_1</td> <td class=3Dxl35>Asset Tag</td> <td class=3Dxl35>Washed Number</td> <td class=3Dxl35>Asset Name</td> <td class=3Dxl35>Cust Code</td> <td class=3Dxl35>Primary IP Address</td> </tr> <tr> <td class=3Dxl32 x:num=3D"7258807">R2Col1</td> <td class=3Dxl29>LEN2222</td> <td class=3Dxl29>RC0</td> <td class=3Dxl29>LEN2222</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.55</td> </tr> <tr> <td rowspan=3D2 class=3Dxl32 x:num=3D"7258830">R3Col1</td> <td rowspan=3D2 class=3Dxl29>L3333</td> <td rowspan=3D2 class=3Dxl29>090</td> <td rowspan=3D2 class=3Dxl29>TSMWAL</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.137</td> </tr> <tr> <td class=3Dxl29>R4Col4</td> <td class=3Dxl29>1.1.1.137</td> </tr> <tr> <td class=3Dxl32 x:num=3D"7258831">R5Col1</td> <td class=3Dxl29>TSH5555</td> <td class=3Dxl29>4H</td> <td class=3Dxl29>TSH5555 this data gets dropped but should not</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.69</td> </tr> <tr> <td rowspan=3D2 class=3Dxl32 x:num=3D"7258844">R6Col1</td> <td rowspan=3D2 class=3Dxl29>146666</td> <td rowspan=3D2 class=3Dxl29>2-0</td> <td rowspan=3D2 class=3Dxl29>TSM</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.11</td> </tr> <tr> <td class=3Dxl29>R7Col4</td> <td class=3Dxl29>1.1.1.11</td> </tr> </table> </body>

In reply to HTML::TableExtract problem handling merged cells across rows by jtravillian

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.