Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^2: How would you extract *content* from websites?

by Ovid (Cardinal)
on Jun 17, 2005 at 18:31 UTC ( [id://467836]=note: print w/replies, xml ) Need Help??


in reply to Re: How would you extract *content* from websites?
in thread How would you extract *content* from websites?

The problem is that this is going to leave a lot of "non content" data such as menu link names, possible advertising text, etc. While it's a very poor guide, HTML can serve as "metadata" that allows you to navigate to the actual content. Remove that before getting to your content and the spider won't be able to make intelligent decisions.

Cheers,
Ovid

New address of my CGI Course.

  • Comment on Re^2: How would you extract *content* from websites?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://467836]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-20 14:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found