Re: Reverse engineering HTML

I've ditched Perl for parsing HTML in favour of HTML-tidy and XSL stylesheets when it comes to extraction of data from HTML.

HTML-tidy is a tool that tries to convert ugly HTML into well-formed XHTML, and it does a good job on it. You might want to preprocess your HTML with it, as it removes a lot of the ugly special cases that make interpreting HTML such a pain.

XSL stylesheets (I use Saxon as the interpreter) provide an easy way to transform XML (and XHTML is a special case of XML) into other ASCII formatted files, using a regular-expression like method (although the syntax is not really the syntax of regular expressions).

If you're not afraid to include the two system calls (HTML-tidy promises a Perl API, and there are XSL-APIs for Perl as well), this might make your work a little bit easier.

Comment on Re: Reverse engineering HTML

Replies are listed 'Best First'.
Re: Re: Reverse engineering HTML by THRAK (Monk) on Jun 14, 2001 at 21:06 UTC
I have to give a big ++ to Corion for this advice. If you have malformed HTML, running it through Tidy will definately make it far more useable. Although there is currently not a Perl implementation of it (WHAH!), it is very easy to incorporate via a Perl system call. If you have a lot of pages to process, you can build a Perl looping structure and process them one after another. If this is part of an inline process, you can run each file through before you Parse or do whatever with it. I'm currently implementing such an inline Tidy & Perl HTML::Parser process into an existing PHP process. If you have any question, feel free to contact me. -THRAK www.polarlava.com	[reply]

Replies are listed 'Best First'.

Re: Re: Reverse engineering HTML
by THRAK (Monk) on Jun 14, 2001 at 21:06 UTC

www.polarlava.com

[reply]