in reply to LWP::Simple... Enough for Site Query & Data Download?

Corion was right earlier. WWW::Mechanize is the easiest interface to the things you want to do: submit (no-JavaScript) forms programmatically. Make sure the site you are querying allows it. Many (most?) sites do not permit any data scraping in their terms of service. Some will make allowances if you ask formally. Some have APIs to get the data in a robust/correct way.

update: CountZero's also right. You'll want HTML parsing after the form results return. HTML::TokeParser::Simple, HTML::TreeBuilder, or XML::LibXML for example. If you get stuck on one come back here.

  • Comment on Re: LWP::Simple... Enough for Site Query & Data Download?

Replies are listed 'Best First'.
Re^2: LWP::Simple... Enough for Site Query & Data Download?
by cheech (Beadle) on Jun 15, 2009 at 00:20 UTC
    I used the example code found at the WWW::Mechanize doc to test my wanted page:
    use strict; use warnings; use WWW::Mechanize; my $outfilename = "data.txt"; open(TFILE,">$outfilename"); my $mech = WWW::Mechanize->new(); $mech->get( "http://bub2.meteo.psu.edu/wxstn/wxstn.htm" ); $mech->forms; print TFILE "$mech \n"; exit;

    I was expecting a list of the available forms on that page, but got the following line of text instead:

    WWW::Mechanize=HASH(0x18454fc)

    Is this because my variable $mech is in scalar context and therefore returned a reference to the array holding the form id's?

    Thanks

      Go through the docs a little more carefully. You're printing the mech object, not the forms. Try this instead-

      use WWW::Mechanize; use YAML (); my $mech = WWW::Mechanize->new(); $mech->get( "http://bub2.meteo.psu.edu/wxstn/wxstn.htm" ); print YAML::Dump [ $mech->forms ];

      Next stop: the docs for HTML::Form.

        Unfortunately I get the error,
        C:\Perl\scripts>perl -wc foo.pl Can't locate YAML.pm in @INC (@INC contains: C:/Perl/site/lib C:/Perl/lib .) at foo.pl line 6. BEGIN failed--compilation aborted at foo.pl line 6."
        when I run this code..?