Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a web crawler application to index all the pages on our site. The application will generate meta keywords index and page urls in the database so the search script will search the keywords and display the pages. All the static pages have no problems. But some of the dynamic pages which have urls http://hostname/path/filename.jsp/id=number under docs directory need "id" to be displayed. How can I index those pages?

Thanks

Replies are listed 'Best First'.
Re: How to index dynamic pages?
by dda (Friar) on Aug 09, 2002 at 15:04 UTC
    You can create a dummy page with links and let your crawler to read it, for example:
    http://hostname/path/filename.jsp?id=1 http://hostname/path/filename.jsp?id=2 http://hostname/path/filename.jsp?id=3 http://hostname/path/filename.jsp?id=4 ...
    Or you can put your links into your config file. Look at this search engine for examples.

    --dda

      The crawler actually reads the .jsp page, grabs the keywords and url http://hostname/path/filename.jsp without "id" and inserts them in the database. When the keywords are searched, the page file.jsp without the "id" can not be displayed. How can I solve that?

      Thanks

        Many crawlers intentionally sidestep URLs that look like they're dynamic (i.e., URLs that contain ? = &). To trick crawlers like this, you need to use URLs of the form http://hostname/path/filename.jsp/N where N is an alternative for id=N.

        If you were using Perl rather than JSP, it's a simple matter to pick up the /N from $ENV{PATH_INFO} or $ENV{REQUEST_URI}.

        But this isn't JavaMonks, so you're on your own from here.

        >grabs the keywords and url http://hostname/path/filename.jsp without "id" and inserts them in the database

        You wrote that crawler, why then it does such a weird thing? :) Why can't it insert 'id=N' also?

        --dda

•Re: How to index dynamic pages?
by merlyn (Sage) on Aug 09, 2002 at 21:20 UTC
    I am writing a web crawler application to index all the pages on our site. The application will generate meta keywords index and page urls in the database so the search script will search the keywords and display the pages.
    You mean like this one?

    First hint of code reuse: search the CPAN. Second hint: search my site.

    -- Randal L. Schwartz, Perl hacker

      Static pages are fine.
      I have to index jsp pages which can only be displayed with http://hostname/path/filename.jsp/id=N using metadata index.
      Any suggestion?