in reply to crawling one website

hgrepurl.pl from the book Web Client Programming with Perl, full text freely available from the O'Reilly Open Books Project, can get you your links. w3mir from the CPAN can fetch a list of links, and can control the depth of recursion.

HTH,

planetscape

Replies are listed 'Best First'.
Re^2: crawling one website
by vit (Friar) on May 28, 2011 at 14:39 UTC
    Thanks!
    I tried hgrepurl.pl with and without parameters but it does not print anything.
    Could you give me a usage example without proxy.
      Actually I got:
      perl -W hgrepurl.pl http://www.senopt.com Subroutine Cwd::fastcwd redefined at c:/ActivePerl/site/lib/Cwd.pm lin +e 812. Subroutine Cwd::getcwd redefined at c:/ActivePerl/site/lib/Cwd.pm line + 812. Subroutine Cwd::abs_path redefined at c:/ActivePerl/site/lib/Cwd.pm li +ne 812 main::get_html() called too early to check prototype at hgrepurl.pl li +ne 27.
      Is it serious? What should I do?
      I also checked:
      perl -c hgrepurl.pl http://www.senopt.com hgrepurl.pl syntax OK

        Viewing the source of http://www.senopt.com shows "links" like:

        <a href="senopt/VPS/vps.html" target="_blank">vps</a> <br> <a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai +ning</a>

        You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?.

        HTH,

        planetscape