Re: crawling one website

hgrepurl.pl from the book Web Client Programming with Perl, full text freely available from the O'Reilly Open Books Project, can get you your links. w3mir from the CPAN can fetch a list of links, and can control the depth of recursion.

HTH,

planetscape

Comment on Re: crawling one website Download Code

Replies are listed 'Best First'.
Re^2: crawling one website by vit (Friar) on May 28, 2011 at 14:39 UTC
Thanks! I tried hgrepurl.pl with and without parameters but it does not print anything. Could you give me a usage example without proxy.	[reply]
Re^3: crawling one website by vit (Friar) on May 28, 2011 at 14:50 UTC
Actually I got: `perl -W hgrepurl.pl http://www.senopt.com Subroutine Cwd::fastcwd redefined at c:/ActivePerl/site/lib/Cwd.pm lin +e 812. Subroutine Cwd::getcwd redefined at c:/ActivePerl/site/lib/Cwd.pm line + 812. Subroutine Cwd::abs_path redefined at c:/ActivePerl/site/lib/Cwd.pm li +ne 812 main::get_html() called too early to check prototype at hgrepurl.pl li +ne 27.` [download] Is it serious? What should I do? I also checked: `perl -c hgrepurl.pl http://www.senopt.com hgrepurl.pl syntax OK` [download]	[reply] [d/l] [select]
Re^4: crawling one website by planetscape (Chancellor) on May 28, 2011 at 18:05 UTC
Viewing the source of `http://www.senopt.com` shows "links" like: `<a href="senopt/VPS/vps.html" target="_blank">vps</a> <br> <a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai +ning</a>` [download] You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?. HTH, planetscape	[reply] [d/l] [select]
Re^5: crawling one website by vit (Friar) on May 29, 2011 at 00:56 UTC
Re^6: crawling one website by planetscape (Chancellor) on May 29, 2011 at 02:07 UTC
Some notes below your chosen depth have not been shown here