in reply to Re: crawling one website
in thread crawling one website

Thanks!
I tried hgrepurl.pl with and without parameters but it does not print anything.
Could you give me a usage example without proxy.

Replies are listed 'Best First'.
Re^3: crawling one website
by vit (Friar) on May 28, 2011 at 14:50 UTC
    Actually I got:
    perl -W hgrepurl.pl http://www.senopt.com Subroutine Cwd::fastcwd redefined at c:/ActivePerl/site/lib/Cwd.pm lin +e 812. Subroutine Cwd::getcwd redefined at c:/ActivePerl/site/lib/Cwd.pm line + 812. Subroutine Cwd::abs_path redefined at c:/ActivePerl/site/lib/Cwd.pm li +ne 812 main::get_html() called too early to check prototype at hgrepurl.pl li +ne 27.
    Is it serious? What should I do?
    I also checked:
    perl -c hgrepurl.pl http://www.senopt.com hgrepurl.pl syntax OK

      Viewing the source of http://www.senopt.com shows "links" like:

      <a href="senopt/VPS/vps.html" target="_blank">vps</a> <br> <a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai +ning</a>

      You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?.

      HTH,

      planetscape
        Do you want to say that senopt.com does not have real links?

        I tried
        perl hgrepurl.pl http://www.txtlinks.com/
        which is quite encouraging.
        txtlinks is a web directory and it returns first a portion of url encoded stuff, like:
        ...http://www.txtlinks.com/%3Chtml%3E%0A%3Chead%3E%0A%20%20%3Ctitle%3E +TXT%20Links%20Pure%20Links%20Di.......
        without line brakes. Then it returns a bunch of normal links from this site.
        What is the first portion?
        Is it possible to get all real links residing in a site on all depth levels with this program?