HTML::Parser Alternative

sidenote: if you're using a system that has Lynx installed, you can use it as a quick-and-dirty substitute for HTML::Parser. using open's slurp-output-from-a-command feature, and lynx's "-dump" (iirc) switch, you can get a preparsed representation of the page as it would look on your console (i.e. as lynx would lay it out). This can be munged using normal means; if your html looks fairly simple when rendered*,this might be a win in terms of programming complexity.

As an anecdotal usage example, I used this approach at one point to write a "screen scraper" program to pull tens of thousands of books' amazon sales ranks to stick them into a database for analysis. Their html code was fairly grotty, probably to try to prevent this sort of automated digging, but it had to look simple to a human being. In the lynx-parsed output it boiled down to one line that looked like "rank: foo" which was trivial to find/extract information from.

HTH. :-)

* ... and the information that you're interested in is rendered as opposed to being in the tag structure somehow. if you care about what's in the tags, it's time to fire up the Beast that is HTML::Parser...

Comment on HTML::Parser Alternative

Replies are listed 'Best First'.
Re: HTML::Parser Alternative by davorg (Chancellor) on Nov 24, 2001 at 14:12 UTC
That sounds like a terrible idea to me. All you'll get back from `lynx -dump` is plain text. There will no structure in it at all. I'd guess that can only make it much harder to parse the data that you want out of it. -- <http://www.dave.org.uk> "The first rule of Perl club is you don't talk about Perl club."	[reply]