Baz has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,
I'm trying to simulate a search using the search engine here
First of all, I have to get the engine and session ids - which is done by requesting this page, and then searching the source where these ids are embedded.
Second, I make a Search for all Griffin names in BT postcode area. But during this attempt I get -

Your session has been automatically terminated after 30 minutes of inactivity on bt.com. You can now continue your visit on the site.

So I guess I'm doing the cookie part wrong or perhaps I need to tranfer some Environment Variables.
Any ideas?
Barry.

You can view the output of the perl script below here
#!/usr/bin/perl -w #use strict; use URI; use LWP::UserAgent; use HTTP::Request; use HTTP::Headers; use HTTP::Response; use HTTP::Cookies; use HTTP::Request::Common qw(GET POST); my $cookie_file = "cookies.txt"; my $cookie_jar = HTTP::Cookies->new( file => $cookie_file, autosave => 1); my $url_home = "http://www.bt.co.uk/directory-enquiries/dq_home.jsp"; my $url_search = "http://www.bt.co.uk/directory-enquiries/dq_locationf +inder.jsp"; my $ua = new LWP::UserAgent(); # Get a session ID first my $req = GET $url_home; my $res1 = $ua->request($req); die $res1->as_string() . "\n" if $res1->is_error(); die "Can't find a session ID!\n" unless ($res1->as_string() =~ /BV_Ses +sionID=([^&]+)\&/); my $sessID = $1; die "Can't find an engine ID!\n" unless ($res1->as_string() =~ /BV_Eng +ineID=([^&]+)\&/); my $engID = $1; #die "Can't find a Cookie ID!\n" unless ($res1->as_string() =~ /BV_IDS +=([^;]+)\;/); #my $cookie = $1; #print STDERR "Got session ID $sessID\n"; #print STDERR "Got engine ID $engID\n"; #print STDERR "Got Cookie $cookie\n"; $cookie_jar->extract_cookies($res1); # Save the cookie jar's state print "Cookies: ",$cookie_jar->as_string(),"\n"; $cookie_jar->save($cookie_file); ###################### Start Searching # too lazy for urlencode... $sessID =~ s/\@/%40/g; my $request = POST $url_search, [ QRY => 'res', BV_SessionID => $sessID, BV_EngineID => $engID, new_search => 'true', NAM => 'Griffin', GIV => '', LOC => '', STR => '', PCD => 'BT', limit => '50', CallingPage => 'Homepage', ]; $cookie_jar->load; $cookie_jar->add_cookie_header($request); my $res2 = $ua->request($request); ###################### How many BT** on this page my $pageCount = 0; if( $res2->content =~ /(\d+) of (\d+)/) { print $1," of ",$2,"\n"; $pageCount = $2; } my %count = (); my $content = $res2->content; while($content =~ /pcd\=BT(\d+)/g) { $count{$1}++; } foreach my $keys(sort keys %count) { print $keys,": ",$count{$keys},"\n"; } ###################### Reveal Second Page Results die $res2->as_string() . "\n" if $res2->is_error(); die "Can't find a session ID!\n" unless ($res2->as_string() =~ /BV_SessionID=([^&]+)\&/); $sessID = $1; die "Can't find an engine ID!\n" unless ($res2->as_string() =~ /BV_EngineID=([^&]+)\&/); $engID = $1; print STDERR "Got session ID $sessID\n"; print STDERR "Got engine ID $engID\n"; # too lazy for urlencode... $sessID =~ s/\@/%40/g; $request = POST $url_search, [ QRY => 'res', # BV_SessionID => $sessID, # BV_EngineID => $engID, NAM => 'Griffin', lci => '0', PCD => 'BT', start_id => '50', CallingPage => 'Homepage', ]; my $res3 = $ua->request($request); ################### Print 3 pages to http://baz.perlmonk.org/save.html open (LOG,">save.html"); my $fileOut = $res1->as_string().$res2->as_string().$res3->as_string() +; print LOG "$fileOut";

Replies are listed 'Best First'.
Re: Requesting webpages which use cookies and session ids. (rev)
by dws (Chancellor) on Aug 04, 2002 at 19:42 UTC

    This script attacks a similar problem. One difference between my script and yours is   $ua->cookie_jar($cookie_jar); My understanding is that unless you associate the cookie jar directly with the user agent, you won't hang on to session cookies, and hence won't be able to send them on subsequent POST requests.

      Thanks dws - but its still trowing up the same problem as outlined above

      What might I be doing wrong - is there any reason why the cookies wouldn't be working. ANd if not, is it Env. Vars that I need to look at.
      Thanks.

      I've edited the code anyway and here it is -
      #!/usr/bin/perl -w use strict; use URI; use LWP::UserAgent; use HTTP::Request; use HTTP::Headers; use HTTP::Response; use HTTP::Cookies; use HTTP::Request::Common qw(GET POST); my $url_home = "http://www.bt.co.uk/directory-enquiries/dq_home.jsp"; my $ua = new LWP::UserAgent(); $ua->cookie_jar(HTTP::Cookies->new(file => "lwpcookies.txt",autosave = +> 1)); # Get a session ID first my $req = GET $url_home; my $res1 = $ua->request($req); die $res1->as_string() . "\n" if $res1->is_error(); die "Can't find a session ID!\n" unless ($res1->as_string() =~ /BV_Ses +sionID=([^&]+)\&/); my $sessID = $1; die "Can't find an engine ID!\n" unless ($res1->as_string() =~ /BV_Eng +ineID=([^&]+)\&/); my $engID = $1; #die "Can't find a Cookie ID!\n" unless ($res1->as_string() =~ /BV_IDS +=([^;]+)\;/); $cookie_jar->extract_cookies($res1); # Save the cookie jar's state print "Cookies: ",$cookie_jar->as_string(),"\n"; $cookie_jar->save($cookie_file); ###################### Start Searching # too lazy for urlencode... $sessID =~ s/\@/%40/g; my $request = POST $url_search, [ QRY => 'res', BV_SessionID => $sessID, BV_EngineID => $engID, new_search => 'true', NAM => 'Griffin', GIV => '', LOC => '', STR => '', PCD => 'BT', limit => '50', CallingPage => 'Homepage', ]; $cookie_jar->load; $cookie_jar->add_cookie_header($request); my $res2 = $ua->request($request); ###################### How many BT** on this page # deleted ###################### Reveal Second Page Results die $res2->as_string() . "\n" if $res2->is_error(); die "Can't find a session ID!\n" unless ($res2->as_string() =~ /BV_SessionID=([^&]+)\&/); $sessID = $1; die "Can't find an engine ID!\n" unless ($res2->as_string() =~ /BV_EngineID=([^&]+)\&/); $engID = $1; print STDERR "Got session ID $sessID\n"; print STDERR "Got engine ID $engID\n"; # too lazy for urlencode... $sessID =~ s/\@/%40/g; $request = POST $url_search, [ QRY => 'res', NAM => 'Griffin', lci => '0', PCD => 'BT', start_id => '50', CallingPage => 'Homepage', ]; my $res3 = $ua->request($request); ################### Print 3 pages to http://baz.perlmonk.org/save.html open (LOG,">save.html"); my $fileOut = $res1->as_string().$res2->as_string().$res3->as_string() +; print LOG "$fileOut";

        You say you are too lazy for urlencode, but do it partialy anyway. I believe you should not. The HTTP::Request::Common::POST() should take care of that. Please try to comment out the

        # too lazy for urlencode... $sessID =~ s/\@/%40/g;

          Jenda

Re: Requesting webpages which use cookies and session ids. (rev)
by crenz (Priest) on Aug 04, 2002 at 21:00 UTC

    Baz,

    I guess the session ID and engine ID are connected to the search, so try reusing the old IDs. I think for the whole script, you should only need to get a sessID and engID once -- unless you perform more than 10 searches (not counting looking at the 2nd page of a search etc) or your script pauses for a few minutes between the requests.

    Update: I can query the page using lynx just fine, even with disallowing all cookies. So the answer is not in the cookie jar.

      Thanks crenz,
      Yeah, I made the same observation when I rejected cookies using lynx. THe thing is, when I view the first search page using yr original hack, I get the first page of results but I also get a message(at the start of the page) saying I have been disconnected from the search.
      Also if you do the search in IE6, you will only see the ids being passed via the query string for the first page of search results. For subsequent pages (when you click NEXT) you wont see any mention of ids in the query string. But, for our program, if you include the two ids in the string anyway, you get a message (cant remember exactly) but something about the server being buzy (the server isnt buzy - but what ever loop your fall out of you end up getting this message). If you leave the ids out of the search string(as in my code at the moment), you get a message saying that the searching utility only works for Netscape + IE. For both attempts, you get no results for all attempts to veiw beyond the first page of results - instead you get one of the two afore mentioned error messages. Therefore I'm guessing that the ids need to be passed, but IE is using a different method perhaps - i really dont know at this stage. Maybe when you include the ids the second time, the server thinks its processing the first again (i.e. it uses the existance of the ids in the search string to establish if its the 1st results page or a subsequent one)...and thats why you get two different sets of errors, for subsequent search pages, dispite the fact that I would have expected the server to ignore the ids as the NEXT links dont contain them.
      Just now I've tried repeating the search in IE. When the first page of results displayed, I copied the url in the address window and removed the engine and session ids, I then opened up netscape, pasted in the new url and the search worked fine. THerefore I dont think the browsers ever needs to recieve the ids via the the query string. I'm lost, how about you?
        just one of the two afore mentioned error messages subsequent Sorry Jenda, I missed your post....I commented out that conversion line and now I'm getting

        To use this service you will need either an Internet Explorer (IE) browser or Netscape 4.7 and above.

        for the first search page now aswell as for subsequent pages - at least theres some consistency there. :)
Re: Requesting webpages which use cookies and session ids. (rev)
by PodMaster (Abbot) on Aug 05, 2002 at 14:36 UTC
    I got bored , and played with it a little (your code/dilemma). Hopefully you can learn something from the below, as I'm not going to even try to explain (it may be overwhelming, but it's all pretty much self-explanatory). It works for me, as messy with debug info as it is. Why it works? It's the simplest approach I could think of. Since GET requests always worked by copying the url by hand, that's what I stuck to.
    #!/usr/bin/perl -w # /tell baz oy vey, you're abusing as_string in [id://187513], serious +ly abusing it. # /tell baz also, you're parsing html by hand, i don't like that ;) use strict; use Data::Dumper; use HTML::TokeParser; use URI; use LWP::UserAgent; use HTTP::Request; use HTTP::Headers; use HTTP::Response; use HTTP::Cookies; use HTML::LinkExtor; use HTTP::Request::Common qw(GET POST); my $WHATWORKS = 'http://www.bt.co.uk/directory-enquiries/dq_home.jsp?Q +RY=res&BV_SessionID=@@@@0472129835.1028555271@@@@&BV_EngineID=ccccadc +flifjlhkcflgcefkdffndfki.0&new_search=true&NAM=A*&GIV=&LOC=London&STR +=&PCD=&limit=25&CallingPage=Homepage&Search.x=17&Search.y=13'; $WHATWORKS = URI->new($WHATWORKS); warn Dumper{ $WHATWORKS->query_form}; my $cookie_file = "cookies.txt"; my $cookie_jar = HTTP::Cookies->new( file => $cookie_file, autosave => 1, ignore_discard => 1, # IMPORTANT!!!!!!!!!!!! ); my $url_home = "http://www.bt.co.uk/directory-enquiries/dq_home.jsp"; my $url_search = "http://www.bt.co.uk/directory-enquiries/dq_locationf +inder.jsp"; my $ua = new LWP::UserAgent(); $ua->agent( "Mozilla/8.0(${^O};retmaspod)" ); $ua->cookie_jar( $cookie_jar ); # # Get a session ID first my $req = GET $url_home; my $res = $ua->request( $req ); print $res->status_line(); # die Dumper $res; # as you need my %FORMOLA; ParseIt( \$res->{_content} ); # cause http://www.bt.co.uk/directory-enquiries/dq_locationfinder.jsp # requires javascript, and there is no way in hell i'm going to use it # so you gotta do that one on your own Baz, shouldn't be hard # considering I show you how, here $url_search = $WHATWORKS; $req = GET $url_search; $WHATWORKS->query_form( BV_SessionID => $FORMOLA{BV_SessionID} ); $WHATWORKS->query_form( BV_EngineID => $FORMOLA{BV_EngineID} ); warn $FORMOLA{BV_EngineID} ; warn $FORMOLA{BV_SessionID} ; warn Dumper{ $WHATWORKS->query_form}; $res = $ua->request($req); print $res->content(); my $p = new HTML::LinkExtor(undef,$url_search); $p->parse( $res->{_content} ); print Dumper $p->links; die Dumper $res; sub ParseIt { my $p = new HTML::TokeParser( $_[0] ); while(my $t = $p->get_token() ) { # ["S", $tag, $attr, $attrseq, $text] # ["E", $tag, $text] # ["T", $text, $is_data] # ["C", $text] # ["D", $text] # ["PI", $token0, $text] # print Dumper $$t[2] $FORMOLA{ $$t[2]->{name} } = $$t[2]->{value} if $$t[0] eq 'S' and $$t[1] eq 'input' and $$t[2]->{type} eq 'hidden'; } } __END__ stuff I noticed/got from the first page document.dqform.CallingPage.value="locationfinder"; document.dqform.action="/directory-enquiries/dq_locationfinder.jsp"; document.dqform.submit();} <input type=hidden name="QRY" value="res"> <input type=hidden name="BV_SessionID" value="@@@@1590200227.102855146 +9@@@@"> <input type=hidden name="BV_EngineID" value="cccjadcflifjlhlcflgcefkdf +fndfkh.0"> E:\dev>get -x -U -s -S -e "http://www.bt.co.uk/directory-enquiries/dq_ +home.jsp?QRY=res&BV_SessionID=@@@@0472129835.10285 55271@@@@&BV_EngineID=ccccadcflifjlhkcflgcefkdffndfki.0&new_search=tru +e&NAM=A*&GIV=&LOC=London&STR=&PCD=&limit=25&Callin gPage=Homepage&Search.x=17&Search.y=13">g.html LWP::UserAgent::new: () LWP::UserAgent::request: () LWP::UserAgent::send_request: GET http://www.bt.co.uk/directory-enquir +ies/dq_home.jsp?QRY=res&BV_SessionID=@@@@047212983 5.1028555271@@@@&BV_EngineID=ccccadcflifjlhkcflgcefkdffndfki.0&new_sea +rch=true&NAM=A*&GIV=&LOC=London&STR=&PCD=&limit=25 &CallingPage=Homepage&Search.x=17&Search.y=13 LWP::UserAgent::_need_proxy: Not proxied LWP::Protocol::http::request: () LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 976 bytes LWP::Protocol::collect: read 384 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 1360 bytes LWP::Protocol::collect: read 266 bytes LWP::UserAgent::request: Simple response: OK E:\dev>get -v This is lwp-request version 2.01 (libwww-perl-5.64) Copyright 1995-1999, Gisle Aas. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.