Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: How would you extract *content* from websites?

by kaif (Friar)
on Jun 21, 2005 at 07:52 UTC ( [id://468585]=note: print w/replies, xml ) Need Help??


in reply to How would you extract *content* from websites?

This is a problem I've thought a lot about and written many programs to do on a site-by-site basis. Although I haven't really come up with a good solution (and there probably isn't any), I currently scrape websites looking for images. Depending on how you look at it, this can be a considerably harder or easier thing. Basically, to decide which image on a given page is the "most interesting", I look at the filename (and host --- to see if they match), size (filtering out common ad sizes), and placement on the page (in my experience, on a page that has only one "useful" image, it's like to be at the end --- since all the ads are up front).

  • Comment on Re: How would you extract *content* from websites?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://468585]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-25 20:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found