in reply to Re^3: crawling one website
in thread crawling one website

Viewing the source of http://www.senopt.com shows "links" like:

<a href="senopt/VPS/vps.html" target="_blank">vps</a> <br> <a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai +ning</a>

You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?.

HTH,

planetscape

Replies are listed 'Best First'.
Re^5: crawling one website
by vit (Friar) on May 29, 2011 at 00:56 UTC
    Do you want to say that senopt.com does not have real links?

    I tried
    perl hgrepurl.pl http://www.txtlinks.com/
    which is quite encouraging.
    txtlinks is a web directory and it returns first a portion of url encoded stuff, like:
    ...http://www.txtlinks.com/%3Chtml%3E%0A%3Chead%3E%0A%20%20%3Ctitle%3E +TXT%20Links%20Pure%20Links%20Di.......
    without line brakes. Then it returns a bunch of normal links from this site.
    What is the first portion?
    Is it possible to get all real links residing in a site on all depth levels with this program?
      Do you want to say that senopt.com does not have real links?
      ...
      Is it possible to get all real links residing in a site on all depth levels with this program?

      How do you define "real links" ?

      What did your reading of More robust link finding than HTML::LinkExtor/HTML::Parser? suggest?

      HTH,

      planetscape
        By real links I mean full kinks started with http://... not links to sub-directories.
        The program you recommended seems to be what I need. It looks like it retrieves all "real" links from a webpage, but it does not go over a domain tree. So, in order to get all links starting from the root I may use some program (say WWW::Sitemap) which retrieves urls of all depth levels and inside each one I can use hgrepurl.pl to get all links from there.
        Am I right?