| [reply] |
at work, we've written a distributed web spider... basically
it's a forking model, that then get's thrown around on a
mosix cluster... but anyways, i digress. what we've done
is used the Parse::RecDescent module from CPAN and built up
a grammer for the parsing of webpages. Then we describe
a website using the metalanguage described above and it
generates an automaton that goes out, grabs the webpage,
and removes the important parts. Very flexible, very powerful,
and we can parse millions of pages a day with it. | [reply] |
If you know exactly what you're looking for, and it's in random order (all pages are equally important, not a tree; uncontrolled HTML -- you didn't write it) *nothing* beats a forking LWP get, slurp, match except multiple machines doing parallel (fork, get, slurp, match).
Parser has to parse all sorts of fat, sloppy, mixed content, rarely correct HTML written by fools with Word, FrontPage or Dreamweaver so it looks good. I take it you're looking for something specific.
Never looked at the Parse::RecDescent module though; but it has to do something less involved than HTML::Parser. I would look into it, thanks, but you'll needs write your own anyway. So, "keep it short and simple", "spread the work", "tune the fork(s)".
"tips:"
Slurp with $/ keyed on what you're looking for if what you're looking for is likely to be near the beginning of the page, or null (and m//g) if it's not or if there are several potentially random occurances.
(m//g & pos) is real quick and nestable.
Because multiple forks and multiple machines are by nature asynchronous, they make a mighty engine over TCP/IP by using the space.
Vane | [reply] |