in reply to Re: Fetching Web Pages using
in thread Fetching Web Pages using

Hi again, the web page in question is Here

What I'm trying to do is get an approximation of the popularity of a surname in a particular area, and I dont plan to use any more than the 10 searches i'm allocated per day - I might even get away with one search per surname. I have the BT CD(95 pounds) but I cant use it for this purpose, and I've contacted BT and the webmaster and I've got no reply...

Heres the code I'm using...
#!/usr/bin/perl #-Tw use lib '/home/baz/public_html'; use strict; $| = 1; use CGI::Carp "fatalsToBrowser"; use CGI ":all"; use DBI; use LWP::Simple; use LWP::UserAgent; use HTML::TokeParser; use MyVars qw($footer); use HTTP::Cookies; my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"); my $query = new CGI; $ua->cookie_jar(HTTP::Cookies->new(file => "lwpcookies.txt",autosave = +>1)); print $query->header("Ver 1.1"), print start_html; my $req = HTTP::Request->new(GET=>"http://www.bt.co.uk/directory-enquiries/dq_ho +me.jsp"); my $res = $ua->request($req); open (LOG, ">>res.html"); print LOG "$res->content";

Replies are listed 'Best First'.
Re: Fetching Web Pages using get
by crenz (Priest) on Aug 02, 2002 at 18:18 UTC

    Okay, I've got a working search. Let me describe what I did, I think it is a generally useful learning experience.

    First, I looked at the web page. I decided to not preoccupy myself with how to view it using perl, but rather to try to submit a search and get some results.

    The source code shows that the form submit is caught by JavaScript and validated. Fair enough. I look out for lines like

    document.dqform.action="/directory-enquiries/dq_locationfinder.jsp";

    and also for submission buttons (there are none) -- and change the action to a test script. In this case, it's my trusty http://www.web42.com/cgi-bin/test.cgi. Nothing special, but effective for this problem.

    I have to admit this is lazy: I make no effort to understand the (hard to read and longish) HTML source, but rather load the page in my browser, enter the desired values, submit it and let my script show what happened ;-). See the result on the results page.

    I create a simple script to submit the form using the above variables. It works, but the HTML page contains a warning that my connection expired. Now, "expired connections" always point to some persistant variables, like cookies (which I didn't even enable) -- or session IDs. We have two of these IDs in the variable list of the results mentioned above.

    So I just insert another request to first fetch the search page. Then I search it for the two IDs and use them to submit the search. Voilà!

    Still, there are some caveats. You can play with the limits variable, but there seems to be a limit set by the server (50). For that, you'll need to do follow-up requests.

Re: Fetching Web Pages using get
by Poblachtach32 (Acolyte) on Aug 02, 2002 at 13:32 UTC
    Cool! thats interesting. Let me know when you start getting some results....Oh! and sorry I can't be of any help.
      ANy ideas?? - I set the cookie jar file(lwpcookie.txt) as an empty file (2 carridge returs i think) when starting and it remains that way.
      pob - you might be interested in this