Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to "fake" web browser from Perl

by Anonymous Monk
on Nov 29, 2003 at 09:14 UTC ( [id://310847]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, there is interesting question/answering site on MIT :

http://www.ai.mit.edu/projects/infolab/

It's simple web page where user types in question, presses button to submit and then receives answer. I'd like to connect to this site from Perl program so I wonder how this could be done from Perl. I'm newbie so maybe some starting advice or pointer to more info on this matter would be of great help....

Thanks in advance,

regards,

Robert.

Replies are listed 'Best First'.
Re: How to "fake" web browser from Perl
by Roger (Parson) on Nov 29, 2003 at 11:44 UTC
    I have created a little sample perl script to add to liz's comment above. The sample script demonstrates the use of the WWW::Mechanize module. It fetches the url first, fills in the question, submits the form, gets the answer, and reformats the answer (optionally) to plain text wrapping at column 60.

    use strict; use WWW::Mechanize; my $url = "http://www.ai.mit.edu/projects/infolab/"; my $question = 'What is AI'; my $robot = new WWW::Mechanize; $robot->get($url); $robot->form_number('1'); $robot->set_fields('query' => $question); # ask a question $robot->click(); # Get the reply to my question my $html = $robot->content(); # Extract the answer my ($text) = $html =~ /(<H1>START(?:.|\n)*<HR>)/mg; # Reformat the text $text =~ s/<[^>]*>//g; # Strip HTML tags $text =~ s/(?<!\n)\n(?!\n)/ /mg; # Combine lines $text =~ y/ //s; # Squash multiple spaces
    my $len; $text =~ s/((\S+)(?=\n|\s)|\n)/ # Reformat plain text if ($1 eq "\n") { # wrapping at col 60 $len = 0; $1 } elsif ($len + length($1) > 60) { $len = length($1) + 1; "\n$1" } else { $len += length($1) + 1; $1 } /mge;
    # after some playing around, I came up with the # following regex that does wrapping at column # 60 perfectly. I love perl. ;-) $text =~ s/(.{50,60}(?<=\s\b))/$1\n/mg; print "$text\n";
    And the formatted answer to my question is -
    START's reply ===> What is AI Artificial Intelligence is the study of the computations that make it possible to perceive, reason and act. From the perspective of this definition, artificial intelligence differs from most of psychology because of the greater emphasis on computation, and artificial intelligence differs from most of computer science because of the emphasis on perception, reasoning and action. The central goal of the Artificial Intelligence Laboratory is to develop a computational theory of intelligence extending from the manifestation of behavior to operations at the neural level. Current work focuses especially on understanding the role of vision, language, and motor computation in general intelligence.
Re: How to "fake" web browser from Perl
by liz (Monsignor) on Nov 29, 2003 at 09:58 UTC
    The answer is really simple: have a look at the LWP::UserAgent module. That should get you started. Then, maybe later, you might want to have a look at WWW::Mechanize.

    "A fool can ask more questions than a wise (wo)man can answer"

    So please, don't overdo it with your questions from your Perl program... ;-)

    Liz

Re: How to "fake" web browser from Perl (and I mean /really/ fake)
by grinder (Bishop) on Nov 29, 2003 at 16:22 UTC

    The following isn't really the answer to your question, but when I read the title, I thought of something else. I might as well reply, since you never know who will hit this page from a search engine, and maybe this will be useful.

    You see, to request a page from a web site, at the bare minimum you have to open a socket to the port 80 of the remote machine, and send something like:

    GET / HTTP/1.0

    ... with an extra newline to tell the remote server you aren't qualifying the query with other information (and I'm glossing over the definition of a newline...). Still, that is sufficient for a basic page from a basic server.

    That said, there are times when you come across a server and this is not enough. Maybe it insists on a particular version of Microsoft IE or Netscape Navigator (these days, that's getting rarer). Or some other piece of information, because the server is trying to distinguish between programs (such as those that one might write in Perl), and humans sitting behind browsers clicking on buttons.

    When this happens, you really do have to "fake" a web browser in Perl. To do so, you have to send more information along with your request, which hopefully will slide under the radar, and the server will think it's talking to just another user, clicking away in a browser.

    The last time I had to do this, according to the date of the script was 1999-11-04. I no longer recall what I needed this for, but I did name the script sneakyget :)

    #! /usr/local/bin/perl -w use strict; use HTTP::Request; use LWP::UserAgent; $|++; my $URL = shift or die "no url on command line\n"; my $ua = new LWP::UserAgent; $ua->agent('Mozilla/4.7 [en] (Win95; I)'); my $r = new HTTP::Request; $r->header( Accept => [qw{image/gif image/x-xbitmap image/jpeg image/p +jpeg image/png */*}], Accept_Charset => [qw{iso-8859-1 * utf-8}], Accept_Encoding => 'gzip', Accept_Language => 'en', Connection => 'Keep-Alive', ); $r->method( 'GET' ); $r->uri( $URL ); my $res = $ua->request( $r ); print $res->content; warn $res->code;

    This was sufficient at the time for my nefarious purposes. Of course, these days one might have to update it a little with a more current OS and browser. The main point is that you can indeed "fake" a web browser with Perl.

    To find out what a browser sends to a server in its headers along with the GET/POST/whatever request, the following CGI script can come in handy. It just echos back the information the CGI environment has at its disposal.

    #! /usr/local/bin/perl -w use strict; use CGI; my $q = new CGI; print $q->header(), $q->start_html( 'session echo' ), $q->h1( 'session echo' ), $q->table( $q->TR( { -valign=>'top' }, [map { $q->th( {-align=>'right'}, $_ ) . $q->td( $ENV{$_} +) } sort keys %ENV] ) ), $q->end_html();

    With a bit of experimentation you can tell what different browsers send. I used something like this at the time to build the above script.

    If you need to play around with a functional implementation of this CGI script, you can try it out here on jcwren's perlmonk server.

    Finally, for reference (I just might come back here myself some day), the two main RFCs covering HTTP are RFC 1945 for version 1.0 and RFC 2616 for 1.1. Have fun.

      [...] because the server is trying to distinguish between programs (such as those that one might write in Perl), and humans sitting behind browsers clicking on buttons.
      Some sites do this by using Javascript, cookies, use Javascript to set cookies, or set cookies in another page that gets loaded by a browser too (typically images, like ads), in order to make this distinction, in an attempt to make the screenscraper's work as close to impossible as they can.

      They also often check the REFERER header.

      You can also open up a real browser and drive it in code.

      The last time that I really needed to do that was on Windows and I used OLE to drive IE. I haven't investigated how to do it recently, but I would be amazed if you couldn't script Mozilla in some more portable way if you wanted.

Re: How to "fake" web browser from Perl
by xenchu (Friar) on Nov 29, 2003 at 14:25 UTC

    It's simple web page where user types in question, presses button to s +ubmit and then receives answer.

    Aren't you using a better system right now? I mean you enter your question as you have done here and get multiple answers (usually). I realize it is not as fast as the MIT system but you get knowledgeable information tailored to your question instead of a canned answer. You also have Q&A, Meditations, Perl Discussions, Library, etc. So why bother with an inferior system?

    xenchu

    Perl has one Great Advantage and one Great Disadvantage:

    It is easy to write a complex and powerful program in three lines of code.
      I think maybe you are missing the point xenchu. MIT's START project has nothing to do with PerlMonks other than some anonymous person wishes to use START via Perl. START is a "natural language question answer system" whereas PM is a community of Perl enthusiasts. The only real similarity is that both deal with questions. START is, from what I understand, a very mature project, and calling it an "inferior system" compared to PM is (kinda) like calling a bicycle an inferior car; bicycles and cars both provide transportation, but they have two completely different uses.

      (It's quite possible I misunderstood what you were saying. If so, please correct me.)

        I miss a lot of points Mad Hatter, so I may have been completely off-base here. My point, though, was that however good the MIT system is (I have never seen or used it), that IMHO information, especially about Perl, is going to be better using a system dedicated to Perl rather than any Natural Language system. To me START is still canned information and not as useful or as up-to-date as answers from real live people.

        As for the 'inferior system' crack, that was simply sarcasm that obviously fell flat. I apologize to anyone who was offended in any way.

        xenchu

        Perl has one Great Advantage and one Great Disadvantage:

        It is very easy to write a complex and powerful program in three lines of code.
        I asked START a number of questions and the following are what I got:

        1) What is the size of Singapore?
        ->The population of Singapore, Singapore is 2,792,000.
        ( I was expecting the physical size of Singapore... )

        1a) What is the physical size of Singapore?
        ->I don't have this information.
        ( Huh? )

        1b) How big is Singapore?
        ->Singapore is located at 3 feet above sea level.
        ( Hm... )

        2) Who is the prime minister of Singapore?
        ->chief of state: President Sellapan Rama (S. R.) NATHAN (since 1 September 1999) head of government: Prime Minister GOH Chok Tong (since 28 November 1990) and Deputy Prime Ministers Brig. Gen. (Ret.) LEE Hsien Loong (since 28 November 1990) and TAN Keng Yam Tony (since 1 August 1995) cabinet: Cabinet appointed by the president, responsible to Parliament elections: president elected by popular vote for a six-year term; election last held 28 August 1999 (next to be held by August 2005); following legislative elections, the leader of the majority party or the leader of a majority coalition is usually appointed prime minister by the president; deputy prime ministers appointed by the president election results: Sellapan Rama (S. R.) NATHAN elected president unopposed
        ( A simple "Goh Chok Tong" would have been greatly appreciated. )

        3) How many states are there in America?
        ->50 states and 1 district*; Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, District of Columbia*, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming
        ( That looks okay... )

Re: How to "fake" web browser from Perl
by Arbogast (Monk) on Nov 30, 2003 at 14:04 UTC
    OReilly's Perl and LWP (Library for WWW in Perl) probably has alot of answers to your questions. It's a good book, short and to the point.

    http://www.oreilly.com/catalog/perllwp/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://310847]
Approved by Corion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-03-29 06:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found