Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi:

I want to remove session IDs - but only when a bot (user-agent: googlebot, slurp, etc.) is crawling my site (apache + mod_perl).

The session IDs are not generated on the initial page requested (whether index or interior page); but are generated in subsequent links from that requested page.

Example:

page requested:
www*example.org/ (and will appear as such to bots)

links on requested page appear as:
www*example.org/dir1/?sessionID=aa162314bDea53872123
www*example.org/dir2/?sessionID=aa162314bDea53872123
www*example.org/dir3/?sessionID=aa162314bDea53872123
and so on....

Any suggestions, references, or examples of code that has worked for anyone else? I am new at this so simpler is better.

Thanks.

Replies are listed 'Best First'.
Re: Need to remove session IDs
by wolfi (Scribe) on Apr 10, 2004 at 03:16 UTC
    i may be totally under-thinking this, but a really brief and crude thought...just bypass your normal routines, if it's a bot.(I'm assuming, you probably want to do less work on them anyways.)...and just split it into two sections. Bots and Not-Bots.
    if ($ENV{'HTTP_USER_AGENT'}=~/^googlebot\/2\.1|slurp|some_other_bot_na +me$/){ #put any other routines you want to run on the request -> like logs, e +tc here or just... print "Location: "http:\/\/www\.example\.org\/$ENV{'PATH_INFO'}"; } # And if not in your bot-list, do what you originally planned... else{ your original script's body here }

    it probably be easier to use the whole googlebot\2.1 stuff in a $variable associated with some array - than putting too many | or statements into that regex... but i'm being lazy and non-thinking at the moment.

    one word of caution before using something like this-> any and all $ENV variables need to be cleaned up; One needs to ensure that they have no evil characters in them. As a thought for the directories you have there: $ENV{'PATH_INFO'}=~/^([a-zA-Z_0-9]+\/?)*$/

    one can't rely on the $ENVironmental variables too much, but in this case, it probably would work.

Re: Need to remove session IDs
by nothingmuch (Priest) on Apr 10, 2004 at 00:52 UTC
    You could write an output filter, using Apache::Filter, and add something like that.

    But isn't fixing this at the level which it is generated in better? I mean, think of the impact this will have if you had a slight error, or if somebody changes the underlying code without updating the filter.

    I would look at simply denying login/session keeping to bot useragents, or using crafty robots.txt files to simply keep them out of wherever you don't want them.

    Good luck!

    -nuffin
    zz zZ Z Z #!perl
Re: Need to remove session IDs
by Fletch (Bishop) on Apr 10, 2004 at 03:18 UTC

    Another possibility might be to have a PerlTransHandler check the user agent on incoming requests and munge the session ID out of the URL before it gets processed further down the request chain.

Re: Need to remove session IDs
by Hissingsid (Sexton) on Apr 10, 2004 at 07:26 UTC
    Hi,

    IMHO What you are proposing could (possibly) be misinterpreted as cloaking and if detected may have the opposite effect to the one you desire. However since the session IDs are generated by your scripts I think that you should be OK.

    To be on the safe side, particularly if you decide to use a solution outside of your scripts, check what headers the server will be sending back to robots, ideally you want 200 OK.

    Best wishes

    Sid