peterr has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

We are having problems with a site where session ID's are being set, despite the fact that the PHP code is meant to exclude the session ID from the link/url. This is of course causing some grief to the website owner, as we have seen both Yahoo and MSN with links to the website, and they contain the session ID. This can potentially cause big probems.

We have reviewed what needs to be done, and need to test the new PHP code, however do not want to use any of the "Search Engine Simulators" around, I would rather we place a Perl script on the website, and do our own isolated testing.

So, does anyone know of a good Perl script that simulates a 'spider crawl' please, just to show the links and related links, so that we can thouroughly test that session ID's are not appearing. If possible, the script would allow us to enter the 'user agent', because we only want to turn sessions off, for spiders/bots,etc, not the general public.

The type of 'results' I needed are the same as produced by these simulators:

http://www.1-hit.com/all-in-one/tool.search-engine-viewer.htm

http://www.webconfs.com/search-engine-spider-simulator.php

Thanks,

Peter

Replies are listed 'Best First'.
Re: Search Engine Simulator
by Corion (Patriarch) on Dec 10, 2004 at 08:06 UTC

    On the CPAN, there is WWW:Robot, which will spider your URLs, and it is also not very hard to write a spider using WWW::Mechanize. You will need to set the proper UserAgent header in both cases, so your spider (mis)identifies itself as Google, MSN or whatever.se

      Hi,

      Thanks, both of those modules look promising, especially the second one, where we can set the user agent, because we need to test quite a number of agent names to make sure the session id's get turned off for them.

      Peter

        I expect any such module to be based on the LWP library and thus you should be able to set the user agent in any of those libraries with equal ease. But I haven't worked with WWW::Robot, so I don't know if it actually uses LWP (it should).

Re: Search Engine Simulator
by BUU (Prior) on Dec 10, 2004 at 12:42 UTC
    Ignoring the technical question, the social issue sounds impossible. Your problem is when two or more people share one session id right? Well if the session id is in the url, you're going to have this happen a lot, not just from search engine links.

    For one thing you're never going to have an exact list of all the search engines. Another problem is people copying the url to each other, via IM, message boards or those public bookmark lists. Those are all going to include the session id in the url. Theres really no good way to store the session id in the url. If you want it to be restricted to one per person you should use cookies.
Re: Search Engine Simulator
by rupesh (Hermit) on Dec 10, 2004 at 04:44 UTC

    I actually didn't know what a "spider crawl" meant.
    So, I asked the oracle and it told me!
    Probably it could help you also...


    Cheers,
    Rupesh.
Re: Search Engine Simulator
by talexb (Chancellor) on Dec 10, 2004 at 19:54 UTC

    Can't the code that handles the incoming session ID decide whether it's a 'current' session, and if not, assign a new session ID when it re-writes the URL? That would get around the problem that (I think) you've described.

    Or maybe I've missed something.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds