Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^3: How to extract links from a webpage and store them in a mysql database

by g0n (Priest)
on Dec 06, 2006 at 12:36 UTC ( [id://588081]=note: print w/replies, xml ) Need Help??


in reply to Re^2: How to extract links from a webpage and store them in a mysql database
in thread How to extract links from a webpage and store them in a mysql database

Step one is probably to write an algorithm to do what you want. Something like this perhaps:

  • Create your database table with columns for 'link', 'depth', 'read'
  • read the first page and store the base URL
  • for each link in the page, compare its base to the original base URL
  • If they match, add to the DB with depth 2 and read 'no'
  • For each entry in the table where read eq 'no', read the page, set read to 'yes', compare each link base to the original base URL
  • If they match, add to the db with depth 3 and read 'no'
  • repeat the last two steps, setting depth to 4 (i.e. a link found at depth 3)
  • end
You could end when you don't find any entries in the db with depth <=3 and read eq 'no', that way it's easy to modify if you decide to read deeper.

--------------------------------------------------------------

"If there is such a phenomenon as absolute evil, it consists in treating another human being as a thing."
John Brunner, "The Shockwave Rider".

  • Comment on Re^3: How to extract links from a webpage and store them in a mysql database

Replies are listed 'Best First'.
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://588081]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2024-04-18 17:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found