in reply to Re^4: crawling one website
in thread crawling one website

Do you want to say that senopt.com does not have real links?

I tried
perl hgrepurl.pl http://www.txtlinks.com/
which is quite encouraging.
txtlinks is a web directory and it returns first a portion of url encoded stuff, like:
...http://www.txtlinks.com/%3Chtml%3E%0A%3Chead%3E%0A%20%20%3Ctitle%3E +TXT%20Links%20Pure%20Links%20Di.......
without line brakes. Then it returns a bunch of normal links from this site.
What is the first portion?
Is it possible to get all real links residing in a site on all depth levels with this program?

Replies are listed 'Best First'.
Re^6: crawling one website
by planetscape (Chancellor) on May 29, 2011 at 02:07 UTC
    Do you want to say that senopt.com does not have real links?
    ...
    Is it possible to get all real links residing in a site on all depth levels with this program?

    How do you define "real links" ?

    What did your reading of More robust link finding than HTML::LinkExtor/HTML::Parser? suggest?

    HTH,

    planetscape
      By real links I mean full kinks started with http://... not links to sub-directories.
      The program you recommended seems to be what I need. It looks like it retrieves all "real" links from a webpage, but it does not go over a domain tree. So, in order to get all links starting from the root I may use some program (say WWW::Sitemap) which retrieves urls of all depth levels and inside each one I can use hgrepurl.pl to get all links from there.
      Am I right?