Barring something useful like RSS feeds, you're going to have to do this on a site by site basis. What should ideally happen is your spider, when visiting a site, should load the rules for parsing that site. Maybe subclasses that override a &content method would be appropriate.
Regrettably, I do a lot of work like this and it's easier said than done. One thing which can help is looking for "printer friendly" links. Those often lead to a page that strips a lot of the extraneous information off.
Cheers,
Ovid
New address of my CGI Course.
In reply to Re: How would you extract *content* from websites?
by Ovid
in thread How would you extract *content* from websites?
by BUU
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |