Cannot retrieve HTML for some pages with LWP

hitheone has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, Is there anyone can help me with this problem. I use LWP::Simple to browse web data. I can get response with simple pages, but with more complex pages. My simple program is:

#!/usr/bin/perl
use LWP::Simple;
use HTML::Parse;

$ENV{"SYSTEMROOT"} = "C:\\windows"; 
$html = get("http://www.google.com");
$text = parse_html($html)->format;
print $text;
[download]

(If I replace the url by "http://www.scholar.google.com/scholar?hl=en&lr=&q=machine+learning", the data is blank)
Thanks in advance
TD

Janitored by holli - added code tags

Comment on Cannot retrieve HTML for some pages with LWP Download Code

Replies are listed 'Best First'.
Re: Cannot retrieve HTML for some pages with LWP by marnanel (Beadle) on May 27, 2005 at 17:05 UTC
Retrieving that URL gives you a 403 Forbidden error, with an error page that points you at http://www.google.com/terms_of_service.html . This is in place because Google bars automated querying of its site. LWP::Simple's get function doesn't have a way for you to see the return codes, so you wouldn't have seen that this causes an error. (If you want such information, use LWP::UserAgent instead.) Instead, the function just returns an empty string, as you saw.	[reply]
Re^2: Cannot retrieve HTML for some pages with LWP by hitheone (Initiate) on May 27, 2005 at 17:20 UTC
Thanks for your reply. I have the same problem with LWP::UserAgent. I understand the problem. However, how to retrieve web data as a browser, i mean, to realize to action of avoiding automatic access of the web page and fix them.	[reply]
Re^3: Cannot retrieve HTML for some pages with LWP by marnanel (Beadle) on May 27, 2005 at 17:28 UTC
Firstly, please be aware of the issues surrounding accessing Google's site in contravention of their terms of service. It might be easier for you to use Google's own web APIs, assuming they work for Google Scholar. Look into Net::Google for examples which use ordinary Google search. If after all you want to scrape Google Scholar, you may have some luck modifying WWW::Scraper::Google.	[reply]
Re: Cannot retrieve HTML for some pages with LWP by Thelonius (Priest) on May 27, 2005 at 17:26 UTC
It's not much harder to use LWP::UserAgent so that you can get the response status when it fails. `#!/usr/bin/perl use LWP::UserAgent; use strict; my $url = "http://scholar.google.com/scholar?hl=en&lr=&q=machine+learning"; my $ua = LWP::UserAgent->new; $ua->env_proxy; $ua->agent("Mozilla/5.0 (Windows)"); my $response = $ua->get($url); if ($response->is_success) { print $response->content; } else { die $response->status_line; }` [download] However, you may be interested in the Google web APIs, for which there are modules (Net::Google and DBD::Google) on CPAN. Also, if you are interested in just getting the text of a web page, you may find it easier to use "lynx -dump" than perl. You can use it under cygwin on Windows.	[reply] [d/l]
Re: Cannot retrieve HTML for some pages with LWP by johnnywang (Priest) on May 27, 2005 at 17:03 UTC
That URL gives a 302 redirect, I assume LWP::Simple doesn't follow redirects.	[reply]
Re: Cannot retrieve HTML for some pages with LWP by djohnston (Monk) on May 27, 2005 at 17:55 UTC
I had this problem a while back and as I recall, it did in fact have to do with redirects. I am too ashamed to post any of my lame code, but see How to tell if a URL returned a Location: header? for some useful info.	[reply]