Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: How would you extract *content* from websites?

by Your Mother (Archbishop)
on Jun 17, 2005 at 19:23 UTC ( [id://467856]=note: print w/replies, xml ) Need Help??


in reply to How would you extract *content* from websites?

The diff thing is error prone on lots of sites because ads are randomized and menus often change, even if by a single link, per page. Ovid made some good points. Another thing I've relied on when doing this kind of thing is that content has entirely different semantics from navigation and junk.

An article will be made of sentences and not just one or two but a dozen or more. Ads and navigation will rarely be complete sentences and never be more than one or two. I had pretty good success with this strategy building a news/story fetcher 3 years ago for sites without RSS. Plain text --> lines --> filter out everything but contiguous blocks of sentences --> choose the largest remaining item.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://467856]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-03-28 11:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found