in reply to Reverse engineering HTML
I've ditched Perl for parsing HTML in favour of HTML-tidy and XSL stylesheets when it comes to extraction of data from HTML.
HTML-tidy is a tool that tries to convert ugly HTML into well-formed XHTML, and it does a good job on it. You might want to preprocess your HTML with it, as it removes a lot of the ugly special cases that make interpreting HTML such a pain.
XSL stylesheets (I use Saxon as the interpreter) provide an easy way to transform XML (and XHTML is a special case of XML) into other ASCII formatted files, using a regular-expression like method (although the syntax is not really the syntax of regular expressions).
If you're not afraid to include the two system calls (HTML-tidy promises a Perl API, and there are XSL-APIs for Perl as well), this might make your work a little bit easier.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Reverse engineering HTML
by THRAK (Monk) on Jun 14, 2001 at 21:06 UTC |