This is a problem I've thought a lot about and written many programs to do on a site-by-site basis. Although I haven't really come up with a good solution (and there probably isn't any), I currently scrape websites looking for images. Depending on how you look at it, this can be a considerably harder or easier thing. Basically, to decide which image on a given page is the "most interesting", I look at the filename (and host --- to see if they match), size (filtering out common ad sizes), and placement on the page (in my experience, on a page that has only one "useful" image, it's like to be at the end --- since all the ads are up front).
In reply to Re: How would you extract *content* from websites?
by kaif
in thread How would you extract *content* from websites?
by BUU
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |