Reverse engineering HTML

larsen has asked for the wisdom of the Perl Monks concerning the following question:

I suppose you were faced at least once with the problem of reverse engineering extremely poor written HTML. In this case the issue is not simply parsing HTML, but guessing the structure (is there's a structure) behind what has been written by some absent minded HTML-coder (artificial, human or a messy combination of both).

Here an example of what I'm coping with:

</tr>
</table><html>
<body bgcolor="#FFFFFF">
<table  border="0" cellspacing="0" cellpadding="0">   <link rel="style
+sheet" href="/stile.css" type="text/css">

        <tr>
       
        <td>
        <span class="span">
        <font color="#FF0000" size="1" face="Verdana, Arial">29/05/200
+1 15:30</font>&nbsp;
<font size="1" face="Verdana, Arial">(ACOI - associazione Chirurghi os
+pedalieri italiani)
        <br>
        <a href="http://www.immedia.it/published/20010529/200105291468
+3.shtml" target="ImmediaPress"><font size="1" face="Verdana, Arial">
        <b>IN DUEMILA DA TUTTO IL MONDO AL CONGRESSO 

DELL'ASSOCIAZIONE DEI CHIRURGHI OSPEDALIERI ITALIANI </b></font>
        </a>
        <br>
        <!-- <font face="verdana, arial, helvetica" size="1">
        (IMMEDIAPRESS) Modena e' stata per quattro giorni una citta' i
+nternazionale grazie...
        </font> -->
        
        </font>
        </td>
        </tr>
    
        <tr>
        <td>
        <img src="http://www.immedia.it/images/line_home.gif" width=30
+8 height=1 border=0 alt=""><br>
        </td>
        </tr>
 <tr><td height="6"></td></tr>

</table>
</body>
</html><html>
<body bgcolor="#FFFFFF">
[download]

...and so on.

Now I'm using Adobe GoLive to dig through HTML code, since it provides a tree view that is the best I'm aware of. I ask you if there are common tecniques or general principles to deal with such problems. Thank you.

2001-06-16 Edit by Corion : Fixed link

Comment on Reverse engineering HTML Download Code

Replies are listed 'Best First'.
Re: Reverse engineering HTML by Corion (Patriarch) on Jun 14, 2001 at 17:52 UTC
I've ditched Perl for parsing HTML in favour of HTML-tidy and XSL stylesheets when it comes to extraction of data from HTML. HTML-tidy is a tool that tries to convert ugly HTML into well-formed XHTML, and it does a good job on it. You might want to preprocess your HTML with it, as it removes a lot of the ugly special cases that make interpreting HTML such a pain. XSL stylesheets (I use Saxon as the interpreter) provide an easy way to transform XML (and XHTML is a special case of XML) into other ASCII formatted files, using a regular-expression like method (although the syntax is not really the syntax of regular expressions). If you're not afraid to include the two system calls (HTML-tidy promises a Perl API, and there are XSL-APIs for Perl as well), this might make your work a little bit easier.	[reply]
Re: Re: Reverse engineering HTML by THRAK (Monk) on Jun 14, 2001 at 21:06 UTC
I have to give a big ++ to Corion for this advice. If you have malformed HTML, running it through Tidy will definately make it far more useable. Although there is currently not a Perl implementation of it (WHAH!), it is very easy to incorporate via a Perl system call. If you have a lot of pages to process, you can build a Perl looping structure and process them one after another. If this is part of an inline process, you can run each file through before you Parse or do whatever with it. I'm currently implementing such an inline Tidy & Perl HTML::Parser process into an existing PHP process. If you have any question, feel free to contact me. -THRAK www.polarlava.com	[reply]
Re: Reverse engineering HTML by Masem (Monsignor) on Jun 14, 2001 at 17:34 UTC
What an ugly mess... I pity you... :-) I'm curious as to why there are multiple `<HTML>` tags in the same document? Assuming that's not an artifact that you created, I would split this huge document up into several parts using these tags as 'delimiters', and handle each piece separately (since multiple `<HTML>` tags have no value). Within those individual pieces, it might be easier to see structure. In this case, the person used a one-column table probably to get some effect, but it's otherwise useless from what I can tell. Programmically, if all you can about is extracting the information from the page, it might just be easier to use lynx to get the text versions, possibly intelligently adding `<P>, <A>, and <UL>` tags and ignoring reset of the formatting, to at least give you a starting point where you have not lost any of the content and can begin anew with the HTML design. Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain	[reply] [d/l] [select]
Re: Reverse engineering HTML by Vynce (Friar) on Jun 14, 2001 at 17:40 UTC
I seriously advise you to throw this so-called HTML away, shoot the author, and rewrite it (from scratch or using a perl-script). Step 3 is optional; steps 1 and 2 are not. I have seen many projects of exactly this sort. they never turn out to be worthwhile; the HTML is always worse than anybody has any reason to expect. every pattern you think you see will be broken, except this one: they will constantly find new and creative ways to abuse the HTML. they will forget to close tables. they will left jsutify and right justify the same paragraph. they will nest comments. (nesting comments, (or fails to work) --> like this). They will even use the blink tag. trust me. let the bad html flow out of your database and out of your life. you don't want it. it is faster to retype it properly than to "fix" that mess. my mother was given a free car once. it's the most expensive car she's ever owned. she buys them now. it's cheaper.	[reply]
Re: Reverse engineering HTML by schumi (Hermit) on Jun 14, 2001 at 19:19 UTC
Your example looks to me like it has been created by some sort of WYSIWYG-tool - to which Adobe GoLive also belongs. I have to cope with a site which has been done entirely with GoLive, and the code behind it is simply abominable. If you find a tool to tidy the code, I'd be grateful to know more about it. I still find html-editors such as HomeSite the most useful. It even offers a tidy-tool which is actually not too bad. When you write your own code you can keep it simple, to the point and correct - assuming you know your html. I do recognise, though, that you can't design a big project from scratch by writing code - you'd probably get to be a hundred and still not finished. But when code is not just way too complex, but just plainly wrong, re-writing it is probably the best idea - right after shooting the author, as Vynce suggests. -- cs	[reply]
Re: Re: Reverse engineering HTML by BrotherAde (Pilgrim) on Jun 20, 2001 at 10:39 UTC
Allow me to differ - I think it is not only possible, but indeed desirable to design any project by writing code by hand. Granted - it might take you longer, but the maintainability is just so much better, which is what really matters in big projects...	[reply]
Re: Re: Re: Reverse engineering HTML by schumi (Hermit) on Jun 20, 2001 at 11:29 UTC
True, maintainability is what really matters in big projects - to us. But (too) often, it's the people who don't have to maintain it who decide on deadlines, and although I know that deadlines are at their best when they woosh past you, IRL it's not always that easy. Deadlines are there to be kept, or else you're out. I don't like this, either, and Ade, you know what site I'm talking about, and that I hate it as much as you did. But, I just had to program a whole new site of 16 pages (which is only the German version, French and Italian follow) in just under a week - and of course I got the last content one day before the site had to be up and running. And I did it all in writing the code by hand. But while fitting tables so that the stuff looks right, I sometimes almost gave in to temptation to re-install Dreamweaver and use it for that. (I didn't, and it's all done now - phew!) There's one more thing to say in favour of writing your own code: Applications like Dreamweaver or Adobe GoLive create tons of redundant code, and that makes file sizes go up. By writing your own code and keeping it clear, simple and to the point, you can keep file size down - and hence improve download time and attractivity (is that the word?) of your site. --cs	[reply]
Re: Reverse engineering HTML by one4k4 (Hermit) on Jun 14, 2001 at 20:01 UTC
One thing I personally do when I get garbage-code like that is to open it in your favorite text editor... and simply reformat every tag to be on a newline. I do it in the editor vs. doing it in Perl, because this way I get to look at the code line by line, and try and figure out where its fsck'd. Just my $0.02 _14k4 - webmaster@poorheart.com (www.poorheart.com)	[reply]


Keep It Simple, Stupid
	PerlMonks