I have been working on a project for the last few months and the one part of it parses HTML pages and creates online forms to allow editing the content of particular parts of said HTML pages. My problem seems to now lie in the fact that browsers are far more forgiving then my tool of choice, HTML::TreeBuilder, because HTML like this:
<html>
<head>
</head>
<body>
<table>
<tr>
<center>
<td> </td>
</center>
</tr>
</table>
</body>
<html>
Turns into:
<html><head> </head><body>
<table>
<tr>
<td><center> </center></td>
<td> </td>
</tr>
</table>
</body>
</html>
The stray <center> tag gets turned into a an extra <td> tag.
While the browser is able to handle the first BAD html, the resulting "corrected" html doesn't display correctly. I had been running the pages through tidy first, but tidy seems to poorly handle many of these cases as well and resulted in even worse formatting.
This is just one example of the type of HTML I need to deal with, skys the limit for what other ill formed documents await me. There is a difference between versions of HTML::TreeBuilder as well, I was running an older copy and upgraded today to the latest version to make sure it wasn't a bug that had been fixed. The results vary from older versions to the latest, but still don't keep the bad html.
here is a sample script:
use strict;
my $html;
while (<DATA>) {
$html .= $_;
}
my $tree = HTML::TreeBuilder->new();
$tree->parse($html);
print $tree->as_HTML();
__DATA__
<html>
<head>
</head>
<body>
<table>
<tr>
<center>
<td> </td>
</center>
</tr>
</table>
</body>
<html>
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.