in reply to Parser for Html

I can second the use of HTML::TokeParser. I wrote a newspaper crawler to pull stories off of newspaper websites a long time ago using regexes - not the way to go. I rewrote the whole thing using HTML::TokeParser and it made an extensive difference.

In it's first incarnation I had it down to 9 rules for parsing 24 papers. Using HTML::TokeParser I was able to parse about 92 sites using 3 rules. Definitely the way to go. :)

Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.