in reply to Most efficient way to parse web pages
at work, we've written a distributed web spider... basically
it's a forking model, that then get's thrown around on a
mosix cluster... but anyways, i digress. what we've done
is used the Parse::RecDescent module from CPAN and built up
a grammer for the parsing of webpages. Then we describe
a website using the metalanguage described above and it
generates an automaton that goes out, grabs the webpage,
and removes the important parts. Very flexible, very powerful,
and we can parse millions of pages a day with it.
|