Re^4: crawling one website

Viewing the source of http://www.senopt.com shows "links" like:

<a href="senopt/VPS/vps.html" target="_blank">vps</a>
<br>
<a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai
+ning</a>
[download]

You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?.

HTH,

planetscape

Comment on Re^4: crawling one website Select or Download Code

Replies are listed 'Best First'.
Re^5: crawling one website by vit (Friar) on May 29, 2011 at 00:56 UTC
Do you want to say that senopt.com does not have real links? I tried `perl hgrepurl.pl http://www.txtlinks.com/` [download] which is quite encouraging. txtlinks is a web directory and it returns first a portion of url encoded stuff, like: `...http://www.txtlinks.com/%3Chtml%3E%0A%3Chead%3E%0A%20%20%3Ctitle%3E +TXT%20Links%20Pure%20Links%20Di.......` [download] without line brakes. Then it returns a bunch of normal links from this site. What is the first portion? Is it possible to get all real links residing in a site on all depth levels with this program?	[reply] [d/l] [select]
Re^6: crawling one website by planetscape (Chancellor) on May 29, 2011 at 02:07 UTC
Do you want to say that senopt.com does not have real links? ... Is it possible to get all real links residing in a site on all depth levels with this program? How do you define "real links" ? What did your reading of More robust link finding than HTML::LinkExtor/HTML::Parser? suggest? HTH, planetscape	[reply]
Re^7: crawling one website by vit (Friar) on May 29, 2011 at 02:36 UTC
By real links I mean full kinks started with http://... not links to sub-directories. The program you recommended seems to be what I need. It looks like it retrieves all "real" links from a webpage, but it does not go over a domain tree. So, in order to get all links starting from the root I may use some program (say WWW::Sitemap) which retrieves urls of all depth levels and inside each one I can use hgrepurl.pl to get all links from there. Am I right?	[reply]
Re^8: crawling one website by planetscape (Chancellor) on May 29, 2011 at 03:10 UTC