in reply to Re^2: crawling one website
in thread crawling one website

Actually I got:
perl -W hgrepurl.pl http://www.senopt.com Subroutine Cwd::fastcwd redefined at c:/ActivePerl/site/lib/Cwd.pm lin +e 812. Subroutine Cwd::getcwd redefined at c:/ActivePerl/site/lib/Cwd.pm line + 812. Subroutine Cwd::abs_path redefined at c:/ActivePerl/site/lib/Cwd.pm li +ne 812 main::get_html() called too early to check prototype at hgrepurl.pl li +ne 27.
Is it serious? What should I do?
I also checked:
perl -c hgrepurl.pl http://www.senopt.com hgrepurl.pl syntax OK

Replies are listed 'Best First'.
Re^4: crawling one website
by planetscape (Chancellor) on May 28, 2011 at 18:05 UTC

    Viewing the source of http://www.senopt.com shows "links" like:

    <a href="senopt/VPS/vps.html" target="_blank">vps</a> <br> <a href="senopt/DogTraining/dogtraining.html" target="_blank">dog trai +ning</a>

    You're going to have to do a bit more work, I'm afraid. You may want to start by reading More robust link finding than HTML::LinkExtor/HTML::Parser?.

    HTH,

    planetscape
      Do you want to say that senopt.com does not have real links?

      I tried
      perl hgrepurl.pl http://www.txtlinks.com/
      which is quite encouraging.
      txtlinks is a web directory and it returns first a portion of url encoded stuff, like:
      ...http://www.txtlinks.com/%3Chtml%3E%0A%3Chead%3E%0A%20%20%3Ctitle%3E +TXT%20Links%20Pure%20Links%20Di.......
      without line brakes. Then it returns a bunch of normal links from this site.
      What is the first portion?
      Is it possible to get all real links residing in a site on all depth levels with this program?
        Do you want to say that senopt.com does not have real links?
        ...
        Is it possible to get all real links residing in a site on all depth levels with this program?

        How do you define "real links" ?

        What did your reading of More robust link finding than HTML::LinkExtor/HTML::Parser? suggest?

        HTH,

        planetscape