Re^2: crawling one website

Replies are listed 'Best First'.
Re^3: crawling one website by vit (Friar) on May 28, 2011 at 14:50 UTC
Actually I got: `perl -W hgrepurl.pl http://www.senopt.com Subroutine Cwd::fastcwd redefined at c:/ActivePerl/site/lib/Cwd.pm lin +e 812. Subroutine Cwd::getcwd redefined at c:/ActivePerl/site/lib/Cwd.pm line + 812. Subroutine Cwd::abs_path redefined at c:/ActivePerl/site/lib/Cwd.pm li +ne 812 main::get_html() called too early to check prototype at hgrepurl.pl li +ne 27.` [download] Is it serious? What should I do? I also checked: `perl -c hgrepurl.pl http://www.senopt.com hgrepurl.pl syntax OK` [download]	[reply] [d/l] [select]
Re^4: crawling one website by planetscape (Chancellor) on May 28, 2011 at 18:05 UTC
Viewing the source of `http://www.senopt.com` shows "links" like: `<a href="senopt/VPS/vps.html" target="_blank">vps</a> <br> <a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai +ning</a>` [download] You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?. HTH, planetscape	[reply] [d/l] [select]
Re^5: crawling one website by vit (Friar) on May 29, 2011 at 00:56 UTC
Do you want to say that senopt.com does not have real links? I tried `perl hgrepurl.pl http://www.txtlinks.com/` [download] which is quite encouraging. txtlinks is a web directory and it returns first a portion of url encoded stuff, like: `...http://www.txtlinks.com/%3Chtml%3E%0A%3Chead%3E%0A%20%20%3Ctitle%3E +TXT%20Links%20Pure%20Links%20Di.......` [download] without line brakes. Then it returns a bunch of normal links from this site. What is the first portion? Is it possible to get all real links residing in a site on all depth levels with this program?	[reply] [d/l] [select]
Re^6: crawling one website by planetscape (Chancellor) on May 29, 2011 at 02:07 UTC
Re^7: crawling one website by vit (Friar) on May 29, 2011 at 02:36 UTC
Some notes below your chosen depth have not been shown here