in reply to Most efficient way to parse web pages

at work, we've written a distributed web spider... basically it's a forking model, that then get's thrown around on a mosix cluster... but anyways, i digress. what we've done is used the Parse::RecDescent module from CPAN and built up a grammer for the parsing of webpages. Then we describe a website using the metalanguage described above and it generates an automaton that goes out, grabs the webpage, and removes the important parts. Very flexible, very powerful, and we can parse millions of pages a day with it.
  • Comment on Re: Most efficient way to parse web pages