Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
there is interesting question/answering site on MIT :
http://www.ai.mit.edu/projects/infolab/
It's simple web page where user types in question, presses button to submit and then receives answer. I'd like to connect to this site from Perl program so I wonder how this could be done from Perl. I'm newbie so maybe some starting advice or pointer to more info on this matter would be of great help....
Thanks in advance,
regards,
Robert.
Re: How to "fake" web browser from Perl
by Roger (Parson) on Nov 29, 2003 at 11:44 UTC
|
I have created a little sample perl script to add to liz's comment above. The sample script demonstrates the use of the WWW::Mechanize module. It fetches the url first, fills in the question, submits the form, gets the answer, and reformats the answer (optionally) to plain text wrapping at column 60.
use strict;
use WWW::Mechanize;
my $url = "http://www.ai.mit.edu/projects/infolab/";
my $question = 'What is AI';
my $robot = new WWW::Mechanize;
$robot->get($url);
$robot->form_number('1');
$robot->set_fields('query' => $question); # ask a question
$robot->click();
# Get the reply to my question
my $html = $robot->content();
# Extract the answer
my ($text) = $html =~ /(<H1>START(?:.|\n)*<HR>)/mg;
# Reformat the text
$text =~ s/<[^>]*>//g; # Strip HTML tags
$text =~ s/(?<!\n)\n(?!\n)/ /mg; # Combine lines
$text =~ y/ //s; # Squash multiple spaces
my $len;
$text =~ s/((\S+)(?=\n|\s)|\n)/ # Reformat plain text
if ($1 eq "\n") { # wrapping at col 60
$len = 0;
$1
} elsif ($len + length($1) > 60) {
$len = length($1) + 1;
"\n$1"
} else {
$len += length($1) + 1;
$1
}
/mge;
# after some playing around, I came up with the
# following regex that does wrapping at column
# 60 perfectly. I love perl. ;-)
$text =~ s/(.{50,60}(?<=\s\b))/$1\n/mg;
print "$text\n";
And the formatted answer to my question is -
START's reply
===> What is AI
Artificial Intelligence is the study of the computations
that make it possible to perceive, reason and act.
From the perspective of this definition, artificial
intelligence differs from most of psychology because of the
greater emphasis on computation, and artificial intelligence
differs from most of computer science because of the
emphasis on perception, reasoning and action.
The central goal of the Artificial Intelligence Laboratory
is to develop a computational theory of intelligence
extending from the manifestation of behavior to operations
at the neural level. Current work focuses especially on
understanding the role of vision, language, and motor
computation in general intelligence.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: How to "fake" web browser from Perl
by liz (Monsignor) on Nov 29, 2003 at 09:58 UTC
|
The answer is really simple: have a look at the LWP::UserAgent module. That should get you started. Then, maybe later, you might want to have a look at WWW::Mechanize.
"A fool can ask more questions than a wise (wo)man can answer"
So please, don't overdo it with your questions from your Perl program... ;-)
Liz | [reply] [Watch: Dir/Any] |
Re: How to "fake" web browser from Perl (and I mean /really/ fake)
by grinder (Bishop) on Nov 29, 2003 at 16:22 UTC
|
The following isn't really the answer to your question, but when I read the title, I thought of something else. I might as well reply, since you never know who will hit this page from a search engine, and maybe this will be useful.
You see, to request a page from a web site, at the bare minimum you have to open a socket to the port 80 of the remote machine, and send something like:
GET / HTTP/1.0
... with an extra newline to tell the remote server you aren't qualifying the query with other information (and I'm glossing over the definition of a newline...). Still, that is sufficient for a basic page from a basic server.
That said, there are times when you come across a server and this is not enough. Maybe it insists on a particular version of Microsoft IE or Netscape Navigator (these days, that's getting rarer). Or some other piece of information, because the server is trying to distinguish between programs (such as those that one might write in Perl), and humans sitting behind browsers clicking on buttons.
When this happens, you really do have to "fake" a web browser in Perl. To do so, you have to send more information along with your request, which hopefully will slide under the radar, and the server will think it's talking to just another user, clicking away in a browser.
The last time I had to do this, according to the date of the script was 1999-11-04. I no longer recall what I needed this for, but I did name the script sneakyget :)
#! /usr/local/bin/perl -w
use strict;
use HTTP::Request;
use LWP::UserAgent;
$|++;
my $URL = shift or die "no url on command line\n";
my $ua = new LWP::UserAgent;
$ua->agent('Mozilla/4.7 [en] (Win95; I)');
my $r = new HTTP::Request;
$r->header( Accept => [qw{image/gif image/x-xbitmap image/jpeg image/p
+jpeg image/png */*}],
Accept_Charset => [qw{iso-8859-1 * utf-8}],
Accept_Encoding => 'gzip',
Accept_Language => 'en',
Connection => 'Keep-Alive',
);
$r->method( 'GET' );
$r->uri( $URL );
my $res = $ua->request( $r );
print $res->content;
warn $res->code;
This was sufficient at the time for my nefarious purposes. Of course, these days one might have to update it a little with a more current OS and browser. The main point is that you can indeed "fake" a web browser with Perl.
To find out what a browser sends to a server in its headers along with the GET/POST/whatever request, the following CGI script can come in handy. It just echos back the information the CGI environment has at its disposal.
#! /usr/local/bin/perl -w
use strict;
use CGI;
my $q = new CGI;
print $q->header(),
$q->start_html( 'session echo' ),
$q->h1( 'session echo' ),
$q->table(
$q->TR( { -valign=>'top' },
[map { $q->th( {-align=>'right'}, $_ ) . $q->td( $ENV{$_}
+) } sort keys %ENV]
)
),
$q->end_html();
With a bit of experimentation you can tell what different browsers send. I used something like this at the time to build the above script.
If you need to play around with a functional implementation of this CGI script, you can try it out here on jcwren's perlmonk server.
Finally, for reference (I just might come back here myself some day), the two main RFCs covering HTTP are RFC 1945 for version 1.0 and RFC 2616 for 1.1. Have fun. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: How to "fake" web browser from Perl
by xenchu (Friar) on Nov 29, 2003 at 14:25 UTC
|
It's simple web page where user types in question, presses button to s
+ubmit and then receives answer.
Aren't you using a better system right now? I mean you enter your question as you have done here and get multiple answers (usually). I realize it is not as fast as the MIT system but you get knowledgeable information tailored to your question instead of a canned answer. You also have Q&A, Meditations, Perl Discussions, Library, etc. So why bother with an inferior system?
xenchu
Perl has one Great Advantage and one Great Disadvantage:
It is easy to write a complex and powerful program in three lines of code.
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
I miss a lot of points Mad Hatter, so I may have been completely off-base here. My point, though, was that however good the MIT system is (I have never seen or used it), that IMHO information, especially about Perl, is going to be better using a system dedicated to Perl rather than any Natural Language system. To me START is still canned information and not as useful or as up-to-date as answers from real live people.
As for the 'inferior system' crack, that was simply sarcasm that obviously fell flat. I apologize to anyone who was offended in any way.
xenchu
Perl has one Great Advantage and one Great Disadvantage:
It is very easy to write a complex and powerful program in three lines of code.
| [reply] [Watch: Dir/Any] |
|
|
I asked START a number of questions and the following are what I got:
1) What is the size of Singapore?
->The population of Singapore, Singapore is 2,792,000. ( I was expecting the physical size of Singapore... )
1a) What is the physical size of Singapore?
->I don't have this information. ( Huh? )
1b) How big is Singapore?
->Singapore is located at 3 feet above sea level. ( Hm... )
2) Who is the prime minister of Singapore?
->chief of state: President Sellapan Rama (S. R.) NATHAN (since 1 September 1999)
head of government: Prime Minister GOH Chok Tong (since 28 November 1990) and Deputy Prime Ministers Brig. Gen. (Ret.) LEE Hsien Loong (since 28 November 1990) and TAN Keng Yam Tony (since 1 August 1995)
cabinet: Cabinet appointed by the president, responsible to Parliament
elections: president elected by popular vote for a six-year term; election last held 28 August 1999 (next to be held by August 2005); following legislative elections, the leader of the majority party or the leader of a majority coalition is usually appointed prime minister by the president; deputy prime ministers appointed by the president
election results: Sellapan Rama (S. R.) NATHAN elected president unopposed ( A simple "Goh Chok Tong" would have been greatly appreciated. )
3) How many states are there in America?
->50 states and 1 district*; Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, District of Columbia*, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming ( That looks okay... )
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
|
Re: How to "fake" web browser from Perl
by Arbogast (Monk) on Nov 30, 2003 at 14:04 UTC
|
OReilly's Perl and LWP (Library for WWW in Perl) probably has alot of answers to your questions. It's a good book, short and to the point.
http://www.oreilly.com/catalog/perllwp/ | [reply] [Watch: Dir/Any] |
|
|