Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: How would you extract *content* from websites?

by TedPride (Priest)
on Jun 17, 2005 at 18:21 UTC ( [id://467832]=note: print w/replies, xml ) Need Help??


in reply to How would you extract *content* from websites?

Remove everything to the end of the BODY tag. Remove all tags, replacing images with their alt text. Then compare the start and end of each page to every other page. Remove material common between x number of pages that's more than x number of words in length (or some combination of the two). This will be the header and footer material.

What's left is the classic "longest substrings common between two pieces of text" problem. There was a discussion of that recently - let me see if I can find the thread...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://467832]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-29 12:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found