HTML::TableExtract problem handling merged cells across rows

jtravillian has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use perl package HTML::TableExtract to manipulate/cleanse an html-tagged data file sent to me from my vendor's reporting system. The data that actually comes to me is long and complex, but I have both business-cleansed and simplified the input data to present it here.

The tagged data comes to me in a file with an XLS extension, but I have copied the file and changed the new file extension to shtml.

Some input lines with the html have merged fields with the line above it. However, both Excel and IE render the respective data files as they are intended to look.

HTML::TableExtract seems to get confused (or I have bad code) on an input line following two lines with merged cells. Interestingly, it handles the merged lines fine. But on the line following the merged lines, the package seems to not parse correctly.

my $te = HTML::TableExtract->new( headers => [(@column_names)], keep_h
+eaders => 1  );
$te->parse_file($inFilename_long);
foreach my $ts ($te->tables) {
  print "\nLine 0 ", join(', ',$ts->row(0));  print "  <--- Header row
+";
  print "\nLine 1 ", join(', ',$ts->row(1));
  print "\nLine 2 ", join(', ',$ts->row(2));
  print "\nLine 3 ", join(', ',$ts->row(3));  print "  <-- This has me
+rged cells with the line above it and correctly pushes to the right."
+;
  print "\nLine 4 ", join(', ',$ts->row(4));  print "  <--- Why are th
+ese column values being pushed to the right? Also data is lost.";
  print "\n";
}
[download]

Produces output:


Line 0 Column_1, Asset Tag, Washed Number, Asset Name, Cust Code, Prim
+ary IP Address  <--- Header row
Line 1 R2Col1, LEN2222, RC0, LEN2222, , 1.1.1.55
Line 2 R3Col1, L3333, 090, TSMWAL, , 1.1.1.137
Line 3 , , , , R4Col4, 1.1.1.137  <-- This has merged cells with the l
+ine above it and correctly pushes to the right.
Line 4 , , , , R5Col1, TSH5555  <--- Why are these column values being
+ pushed to the right? Also data is lost.
[download]

Here is the part of interest in the input file in text:

<body link=3Dblue vlink=3Dpurple>
<table x:str border=3D0 cellpadding=3D0 cellspacing=3D0>
 <col style=3D'width:53pt'>
 <col style=3D'width:58pt'>
 <col style=3D'width:91pt'>
 <col style=3D'width:68pt'>
 <col style=3D'width:75pt'>
 <col style=3D'width:75pt'>
 <tr height=3D23 style=3D'height:22.5pt'>
<td colspan=3D31 style=3D'font-family:tahoma;font-size:18.0pt'>W Poten
+tial Under</td></tr>
<tr style=3D'height:10.5pt'><td></td></tr>
<tr style=3D'height:10.5pt'><td colspan=3D31 style=3D'font-family:taho
+ma;font-size:8.0pt;font-weight:700'>Page by:</td></tr><tr style=3D'he
+ight:10.5pt'><td colspan=3D31 style=3D'font-family:tahoma;font-size:8
+.0pt'>Tenant Data Partition: W</td></tr><tr style=3D'height:10.5pt'><
+td></td></tr>
<tr>
 <td class=3Dxl34>Column_1</td>
 <td class=3Dxl35>Asset Tag</td>
 <td class=3Dxl35>Washed Number</td>
 <td class=3Dxl35>Asset Name</td>
 <td class=3Dxl35>Cust Code</td>
 <td class=3Dxl35>Primary IP Address</td>
</tr>
<tr>
 <td class=3Dxl32 x:num=3D"7258807">R2Col1</td>
 <td class=3Dxl29>LEN2222</td>
 <td class=3Dxl29>RC0</td>
 <td class=3Dxl29>LEN2222</td>
 <td class=3Dxl29></td>
 <td class=3Dxl29>1.1.1.55</td>
</tr>
<tr>
 <td rowspan=3D2 class=3Dxl32 x:num=3D"7258830">R3Col1</td>
 <td rowspan=3D2 class=3Dxl29>L3333</td>
 <td rowspan=3D2 class=3Dxl29>090</td>
 <td rowspan=3D2 class=3Dxl29>TSMWAL</td>
 <td class=3Dxl29></td>
 <td class=3Dxl29>1.1.1.137</td>
</tr>
<tr>
 <td class=3Dxl29>R4Col4</td>
 <td class=3Dxl29>1.1.1.137</td>
</tr>
<tr>
 <td class=3Dxl32 x:num=3D"7258831">R5Col1</td>
 <td class=3Dxl29>TSH5555</td>
 <td class=3Dxl29>4H</td>
 <td class=3Dxl29>TSH5555 this data gets dropped but should not</td>
 <td class=3Dxl29></td>
 <td class=3Dxl29>1.1.1.69</td>
</tr>
<tr>
 <td rowspan=3D2 class=3Dxl32 x:num=3D"7258844">R6Col1</td>
 <td rowspan=3D2 class=3Dxl29>146666</td>
 <td rowspan=3D2 class=3Dxl29>2-0</td>
 <td rowspan=3D2 class=3Dxl29>TSM</td>
 <td class=3Dxl29></td>
 <td class=3Dxl29>1.1.1.11</td>
</tr>
<tr>
 <td class=3Dxl29>R7Col4</td>
 <td class=3Dxl29>1.1.1.11</td>
</tr>
</table>
</body>
[download]

Comment on HTML::TableExtract problem handling merged cells across rows Select or Download Code

Replies are listed 'Best First'.

Re: HTML::TableExtract problem handling merged cells across rows (OBO rowspan colspan)
by Anonymous Monk on Feb 26, 2015 at 04:44 UTC

You have an off-by-one error in your data, your expectations are wrong

This is actually row 5 column 4 not column 1 <td class=3Dxl32 x:num=3D"7258831">R5Col1</td>

thus this is column 7 so its dropped <td class=3Dxl29>TSH5555 this data gets dropped but should not</td>

So fix your data or work around it or something else entirely

Here is how I found out

Read more... (4 kB)

head0	head1	head2	head3
0/0 colspan=4 0/0 undef undef undef "0/0" "0/0" "0/0" "0/0"
1/0 rowspan=2 1/0 1/1 1/2 1/3	1/1	1/2	1/3
1/0 rowspan=2 1/0 1/1 1/2 1/3	2/1 colspan=2 rowspan=3 undef 2/1 undef 2/3 "1/0" "2/1" "2/1" "2/3"		2/3
3/0 3/0 undef undef 3/3 "3/0" "2/1" "2/1" "3/3"			3/3
4/0 4/0 undef undef 4/3 "4/0" "2/1" "2/1" "4/3"			4/3
5/0 colspan=2 5/0 undef 5/2 5/3 "5/0" "5/0" "5/2" "5/3"		5/2	5/3

TABLE(0, 0):
head0,head1,head2,head3
 0/0  colspan=4  ,,,
 1/0  rowspan=2  , 1/1 , 1/2  , 1/3  
, 2/1 colspan=2 rowspan=3  ,, 2/3 
 3/0  ,,, 3/3 
 4/0  ,,, 4/3 
 5/0 colspan=2  ,, 5/2 , 5/3 
(
  ["head0", "head1", "head2", "head3"],
  [" 0/0  colspan=4  ", undef, undef, undef],
  [" 1/0  rowspan=2  ", " 1/1 ", " 1/2  ", " 1/3  "],
  [undef, " 2/1 colspan=2 rowspan=3  ", undef, " 2/3 "],
  [" 3/0  ", undef, undef, " 3/3 "],
  [" 4/0  ", undef, undef, " 4/3 "],
  [" 5/0 colspan=2  ", undef, " 5/2 ", " 5/3 "],
)
[download]

[reply]
[d/l]
[select]

Re^2: HTML::TableExtract problem handling merged cells across rows (OBO rowspan colspan)

by jtravillian (Initiate) on Feb 27, 2015 at 15:30 UTC

Thank you for the response. You obviously spent some time looking at this.

Unfortunately, I have little or no control over the incoming data. It's from my vendor's reporting system (aka my vendor's vendor) so I can't really change it.

After reading your post, I investigated further. MS-Excel and I.E. (v8.0) both render the table data the way that the report writer intended. Everything is in the "correct and legible" column.

But Firefox (v31.4) and Chrome (v40) (and apparently perl) all render the table data differently from MS-Excel and I.E. At least to the reader, it would be considered "incorrectly and not legible".

I use "correct" and "incorrectly" as loose terms since I'm not intimately familiar with the HTML standards.

Since the report comes to me as tagged table in a text file, is there a better way I should be parsing this data other than using HTML::TableExtract?

[reply]

Re^3: HTML::TableExtract problem handling merged cells across rows (OBO rowspan colspan)

by poj (Abbot) on Feb 27, 2015 at 16:46 UTC

Not sure if this helps but I 'cleaned' up the tags with this regex to remove the 3D's

s/(class|rowspan|style|colspan)=3D/$1=/g;

#!perl
use strict;
use HTML::TableExtract;
my $infile = 'test.htm';
open IN,'<',$infile or die "$!";
open OUT,'>','clean.htm' or die "$!";
my $html;
while (<IN>){
  s/(class|rowspan|style|colspan)=3D/$1=/g;
  print OUT $_;
  $html .= $_;
}

my @col = ('Column_1','Asset Tag','Washed Number','Asset Name','Cust C
+ode','Primary IP Address');
my $te = HTML::TableExtract->new( headers=>[@col],keep_headers => 1  )
+;
$te->parse( $html );

foreach my $ts ($te->tables) {
  print "\nLine 0 ", join(', ',$ts->row(0));  
  print "\nLine 1 ", join(', ',$ts->row(1));
  print "\nLine 2 ", join(', ',$ts->row(2));
  print "\nLine 3 ", join(', ',$ts->row(3));  
  print "\nLine 4 ", join(', ',$ts->row(4));  
  print "\n";
}
[download]

[reply]
[d/l]
[select]

Re^3: HTML::TableExtract problem handling merged cells across rows (OBO rowspan colspan)

by Anonymous Monk on Feb 28, 2015 at 00:59 UTC

Try https://metacpan.org/pod/Spreadsheet::Read#xls2csv

[reply]

Re: HTML::TableExtract problem handling merged cells across rows
by Anonymous Monk on Feb 24, 2015 at 20:19 UTC

The tagged data comes to me in a file with an XLS extension, but I have copied the file and changed the new file extension to shtml.

Why? Why not try xls ?

[reply]

Re^2: HTML::TableExtract problem handling merged cells across rows

by jtravillian (Initiate) on Feb 25, 2015 at 19:11 UTC

Thanks for your response. My reason for not trying xls now is because I already have Table::Extract working in this same perl file to transform/clean 6 other reports. None of the those reports, however, contain "merged" cells within the table. I'd really like to get the HTML::TableExtract working. Not sure if its my code, but it feels like a bug within HTML::TableExtract.

[reply]

Re^3: HTML::TableExtract problem handling merged cells across rows

by Anonymous Monk on Feb 26, 2015 at 01:23 UTC

Not sure if its my code, but it feels like a bug within HTML::TableExtract.

I'm not sure about that,

The html you show has rowspan=3D2 which is not valid value for https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td#attr-rowspan

Also you forgot to give the wanted/correct output your program is supposed to produce

[reply]