Re: How would you extract *content* from websites?


laziness, impatience, and hubris
	PerlMonks

Re: How would you extract content from websites?

by kaif (Friar)

on Jun 21, 2005 at 07:52 UTC ( [id://468585]=note: print w/replies, xml )

Need Help??

in reply to How would you extract *content* from websites?

This is a problem I've thought a lot about and written many programs to do on a site-by-site basis. Although I haven't really come up with a good solution (and there probably isn't any), I currently scrape websites looking for images. Depending on how you look at it, this can be a considerably harder or easier thing. Basically, to decide which image on a given page is the "most interesting", I look at the filename (and host --- to see if they match), size (filtering out common ad sizes), and placement on the page (in my experience, on a page that has only one "useful" image, it's like to be at the end --- since all the ads are up front).

Comment on Re: How would you extract content from websites?

In Section Meditations

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://468585]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others lurking in the Monastery: (3)

As of 2024-04-25 20:59 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found