hitheone has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, Is there anyone can help me with this problem. I use LWP::Simple to browse web data. I can get response with simple pages, but with more complex pages. My simple program is:

#!/usr/bin/perl use LWP::Simple; use HTML::Parse; $ENV{"SYSTEMROOT"} = "C:\\windows"; $html = get("http://www.google.com"); $text = parse_html($html)->format; print $text;

(If I replace the url by "http://www.scholar.google.com/scholar?hl=en&lr=&q=machine+learning", the data is blank)
Thanks in advance
TD

Janitored by holli - added code tags

Replies are listed 'Best First'.
Re: Cannot retrieve HTML for some pages with LWP
by marnanel (Beadle) on May 27, 2005 at 17:05 UTC
    Retrieving that URL gives you a 403 Forbidden error, with an error page that points you at http://www.google.com/terms_of_service.html . This is in place because Google bars automated querying of its site. LWP::Simple's get function doesn't have a way for you to see the return codes, so you wouldn't have seen that this causes an error. (If you want such information, use LWP::UserAgent instead.) Instead, the function just returns an empty string, as you saw.
      Thanks for your reply. I have the same problem with LWP::UserAgent. I understand the problem. However, how to retrieve web data as a browser, i mean, to realize to action of avoiding automatic access of the web page and fix them.

        Firstly, please be aware of the issues surrounding accessing Google's site in contravention of their terms of service.

        It might be easier for you to use Google's own web APIs, assuming they work for Google Scholar. Look into Net::Google for examples which use ordinary Google search.

        If after all you want to scrape Google Scholar, you may have some luck modifying WWW::Scraper::Google.

Re: Cannot retrieve HTML for some pages with LWP
by Thelonius (Priest) on May 27, 2005 at 17:26 UTC
    It's not much harder to use LWP::UserAgent so that you can get the response status when it fails.
    #!/usr/bin/perl use LWP::UserAgent; use strict; my $url = "http://scholar.google.com/scholar?hl=en&lr=&q=machine+learning"; my $ua = LWP::UserAgent->new; $ua->env_proxy; $ua->agent("Mozilla/5.0 (Windows)"); my $response = $ua->get($url); if ($response->is_success) { print $response->content; } else { die $response->status_line; }
    However, you may be interested in the Google web APIs, for which there are modules (Net::Google and DBD::Google) on CPAN.

    Also, if you are interested in just getting the text of a web page, you may find it easier to use "lynx -dump" than perl. You can use it under cygwin on Windows.

Re: Cannot retrieve HTML for some pages with LWP
by johnnywang (Priest) on May 27, 2005 at 17:03 UTC
    That URL gives a 302 redirect, I assume LWP::Simple doesn't follow redirects.
Re: Cannot retrieve HTML for some pages with LWP
by djohnston (Monk) on May 27, 2005 at 17:55 UTC