Clear questions and runnable code get the best and fastest answer |
|
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
The problem is that this is going to leave a lot of "non content" data such as menu link names, possible advertising text, etc. While it's a very poor guide, HTML can serve as "metadata" that allows you to navigate to the actual content. Remove that before getting to your content and the spider won't be able to make intelligent decisions. Cheers, New address of my CGI Course. In reply to Re^2: How would you extract *content* from websites?
by Ovid
|
|