jtravillian has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to use perl package HTML::TableExtract to manipulate/cleanse an html-tagged data file sent to me from my vendor's reporting system. The data that actually comes to me is long and complex, but I have both business-cleansed and simplified the input data to present it here.
The tagged data comes to me in a file with an XLS extension, but I have copied the file and changed the new file extension to shtml.
Some input lines with the html have merged fields with the line above it. However, both Excel and IE render the respective data files as they are intended to look.
HTML::TableExtract seems to get confused (or I have bad code) on an input line following two lines with merged cells. Interestingly, it handles the merged lines fine. But on the line following the merged lines, the package seems to not parse correctly.
my $te = HTML::TableExtract->new( headers => [(@column_names)], keep_h +eaders => 1 ); $te->parse_file($inFilename_long); foreach my $ts ($te->tables) { print "\nLine 0 ", join(', ',$ts->row(0)); print " <--- Header row +"; print "\nLine 1 ", join(', ',$ts->row(1)); print "\nLine 2 ", join(', ',$ts->row(2)); print "\nLine 3 ", join(', ',$ts->row(3)); print " <-- This has me +rged cells with the line above it and correctly pushes to the right." +; print "\nLine 4 ", join(', ',$ts->row(4)); print " <--- Why are th +ese column values being pushed to the right? Also data is lost."; print "\n"; }
Here is the part of interest in the input file in text:Line 0 Column_1, Asset Tag, Washed Number, Asset Name, Cust Code, Prim +ary IP Address <--- Header row Line 1 R2Col1, LEN2222, RC0, LEN2222, , 1.1.1.55 Line 2 R3Col1, L3333, 090, TSMWAL, , 1.1.1.137 Line 3 , , , , R4Col4, 1.1.1.137 <-- This has merged cells with the l +ine above it and correctly pushes to the right. Line 4 , , , , R5Col1, TSH5555 <--- Why are these column values being + pushed to the right? Also data is lost.
<body link=3Dblue vlink=3Dpurple> <table x:str border=3D0 cellpadding=3D0 cellspacing=3D0> <col style=3D'width:53pt'> <col style=3D'width:58pt'> <col style=3D'width:91pt'> <col style=3D'width:68pt'> <col style=3D'width:75pt'> <col style=3D'width:75pt'> <tr height=3D23 style=3D'height:22.5pt'> <td colspan=3D31 style=3D'font-family:tahoma;font-size:18.0pt'>W Poten +tial Under</td></tr> <tr style=3D'height:10.5pt'><td></td></tr> <tr style=3D'height:10.5pt'><td colspan=3D31 style=3D'font-family:taho +ma;font-size:8.0pt;font-weight:700'>Page by:</td></tr><tr style=3D'he +ight:10.5pt'><td colspan=3D31 style=3D'font-family:tahoma;font-size:8 +.0pt'>Tenant Data Partition: W</td></tr><tr style=3D'height:10.5pt'>< +td></td></tr> <tr> <td class=3Dxl34>Column_1</td> <td class=3Dxl35>Asset Tag</td> <td class=3Dxl35>Washed Number</td> <td class=3Dxl35>Asset Name</td> <td class=3Dxl35>Cust Code</td> <td class=3Dxl35>Primary IP Address</td> </tr> <tr> <td class=3Dxl32 x:num=3D"7258807">R2Col1</td> <td class=3Dxl29>LEN2222</td> <td class=3Dxl29>RC0</td> <td class=3Dxl29>LEN2222</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.55</td> </tr> <tr> <td rowspan=3D2 class=3Dxl32 x:num=3D"7258830">R3Col1</td> <td rowspan=3D2 class=3Dxl29>L3333</td> <td rowspan=3D2 class=3Dxl29>090</td> <td rowspan=3D2 class=3Dxl29>TSMWAL</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.137</td> </tr> <tr> <td class=3Dxl29>R4Col4</td> <td class=3Dxl29>1.1.1.137</td> </tr> <tr> <td class=3Dxl32 x:num=3D"7258831">R5Col1</td> <td class=3Dxl29>TSH5555</td> <td class=3Dxl29>4H</td> <td class=3Dxl29>TSH5555 this data gets dropped but should not</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.69</td> </tr> <tr> <td rowspan=3D2 class=3Dxl32 x:num=3D"7258844">R6Col1</td> <td rowspan=3D2 class=3Dxl29>146666</td> <td rowspan=3D2 class=3Dxl29>2-0</td> <td rowspan=3D2 class=3Dxl29>TSM</td> <td class=3Dxl29></td> <td class=3Dxl29>1.1.1.11</td> </tr> <tr> <td class=3Dxl29>R7Col4</td> <td class=3Dxl29>1.1.1.11</td> </tr> </table> </body>
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: HTML::TableExtract problem handling merged cells across rows (OBO rowspan colspan)
by Anonymous Monk on Feb 26, 2015 at 04:44 UTC | |
by jtravillian (Initiate) on Feb 27, 2015 at 15:30 UTC | |
by poj (Abbot) on Feb 27, 2015 at 16:46 UTC | |
by Anonymous Monk on Feb 28, 2015 at 00:59 UTC | |
|
Re: HTML::TableExtract problem handling merged cells across rows
by Anonymous Monk on Feb 24, 2015 at 20:19 UTC | |
by jtravillian (Initiate) on Feb 25, 2015 at 19:11 UTC | |
by Anonymous Monk on Feb 26, 2015 at 01:23 UTC |