in reply to Fetching Web Pages using

I hope what you're trying to do is still ethical...

Some web pages check all kind of parameters. I once had the problem that I wanted people to be able to send me SMS via e-mail. I used a perl script that read the e-mail and access my phone company's website to fill in a form to submit the SMS. I started out being honest, giving a nice and true user-agent name (with my e-mail adress) etc., but made more and more changes until the site accepted my submissions.

Well, to cut this short:

my $agentname = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)'; my $ua = new LWP::UserAgent; $ua->agent($agentname); my $request = GET $send_url; # Fake IE... $request->header('Accept' => 'image/gif, image/x-xbitmap, image/jp +eg, image/pjpe g, application/vnd.ms-powerpoint, application/vnd.ms-excel, applicatio +n/msword, */*'); $request->header('Accept-Language' => 'en-us'); $request->header('Referer' => $referer); my $response = $ua->request($request);

It seems likely that you need to accept  Cookies as well. Take a look at what you are getting -- maybe you need to reaccess the page because you got redirected together with a cookie:

if ($res->is_redirect()) { my $loc = $res->header('Location'); # create new request like above # and reaccess site }

Replies are listed 'Best First'.
Fetching Web Pages using get
by Baz (Friar) on Aug 02, 2002 at 12:31 UTC
    Hi again, the web page in question is Here

    What I'm trying to do is get an approximation of the popularity of a surname in a particular area, and I dont plan to use any more than the 10 searches i'm allocated per day - I might even get away with one search per surname. I have the BT CD(95 pounds) but I cant use it for this purpose, and I've contacted BT and the webmaster and I've got no reply...

    Heres the code I'm using...
    #!/usr/bin/perl #-Tw use lib '/home/baz/public_html'; use strict; $| = 1; use CGI::Carp "fatalsToBrowser"; use CGI ":all"; use DBI; use LWP::Simple; use LWP::UserAgent; use HTML::TokeParser; use MyVars qw($footer); use HTTP::Cookies; my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"); my $query = new CGI; $ua->cookie_jar(HTTP::Cookies->new(file => "lwpcookies.txt",autosave = +>1)); print $query->header("Ver 1.1"), print start_html; my $req = HTTP::Request->new(GET=>"http://www.bt.co.uk/directory-enquiries/dq_ho +me.jsp"); my $res = $ua->request($req); open (LOG, ">>res.html"); print LOG "$res->content";

      Okay, I've got a working search. Let me describe what I did, I think it is a generally useful learning experience.

      First, I looked at the web page. I decided to not preoccupy myself with how to view it using perl, but rather to try to submit a search and get some results.

      The source code shows that the form submit is caught by JavaScript and validated. Fair enough. I look out for lines like

      document.dqform.action="/directory-enquiries/dq_locationfinder.jsp";

      and also for submission buttons (there are none) -- and change the action to a test script. In this case, it's my trusty http://www.web42.com/cgi-bin/test.cgi. Nothing special, but effective for this problem.

      I have to admit this is lazy: I make no effort to understand the (hard to read and longish) HTML source, but rather load the page in my browser, enter the desired values, submit it and let my script show what happened ;-). See the result on the results page.

      I create a simple script to submit the form using the above variables. It works, but the HTML page contains a warning that my connection expired. Now, "expired connections" always point to some persistant variables, like cookies (which I didn't even enable) -- or session IDs. We have two of these IDs in the variable list of the results mentioned above.

      So I just insert another request to first fetch the search page. Then I search it for the two IDs and use them to submit the search. Voilà!

      Still, there are some caveats. You can play with the limits variable, but there seems to be a limit set by the server (50). For that, you'll need to do follow-up requests.

      Cool! thats interesting. Let me know when you start getting some results....Oh! and sorry I can't be of any help.
        ANy ideas?? - I set the cookie jar file(lwpcookie.txt) as an empty file (2 carridge returs i think) when starting and it remains that way.
        pob - you might be interested in this