cheech has asked for the wisdom of the Perl Monks concerning the following question:

I'm finally at the point that I can begin building a "first draft" of my summer project. Thanks to everyone who offered suggestions to my posts in the past couple days.

The first step of my project is to write a code that will prompt the user for the desired date, and then go to a website that has historical weather data that can be downloaded.

My question is, does LWP::Simple, contain all the tools I need to automate visiting a site, sending a query for the data on the submitted date, searching for specific strings, and then downloading the necessary data and writing it to a file?

I can figure out how to get it working IF the package contains everything I need. I just want to make sure I have everything necessary for this to work because I know I won't be able to deduce that on my own in any reasonable amount of time.

Thanks again

  • Comment on LWP::Simple... Enough for Site Query & Data Download?

Replies are listed 'Best First'.
Re: LWP::Simple... Enough for Site Query & Data Download?
by Your Mother (Archbishop) on Jun 14, 2009 at 22:51 UTC

    Corion was right earlier. WWW::Mechanize is the easiest interface to the things you want to do: submit (no-JavaScript) forms programmatically. Make sure the site you are querying allows it. Many (most?) sites do not permit any data scraping in their terms of service. Some will make allowances if you ask formally. Some have APIs to get the data in a robust/correct way.

    update: CountZero's also right. You'll want HTML parsing after the form results return. HTML::TokeParser::Simple, HTML::TreeBuilder, or XML::LibXML for example. If you get stuck on one come back here.

      I used the example code found at the WWW::Mechanize doc to test my wanted page:
      use strict; use warnings; use WWW::Mechanize; my $outfilename = "data.txt"; open(TFILE,">$outfilename"); my $mech = WWW::Mechanize->new(); $mech->get( "http://bub2.meteo.psu.edu/wxstn/wxstn.htm" ); $mech->forms; print TFILE "$mech \n"; exit;

      I was expecting a list of the available forms on that page, but got the following line of text instead:

      WWW::Mechanize=HASH(0x18454fc)

      Is this because my variable $mech is in scalar context and therefore returned a reference to the array holding the form id's?

      Thanks

        Go through the docs a little more carefully. You're printing the mech object, not the forms. Try this instead-

        use WWW::Mechanize; use YAML (); my $mech = WWW::Mechanize->new(); $mech->get( "http://bub2.meteo.psu.edu/wxstn/wxstn.htm" ); print YAML::Dump [ $mech->forms ];

        Next stop: the docs for HTML::Form.

Re: LWP::Simple... Enough for Site Query & Data Download?
by JavaFan (Canon) on Jun 14, 2009 at 20:48 UTC
    My question is, does LWP::Simple, contain all the tools I need to automate visiting a site, sending a query for the data on the submitted date, searching for specific strings, and then downloading the necessary data and writing it to a file?
    Well, if you mean LWP::Simple + Perl, then the answer is yes.

    Does that mean life suddenly becomes easy if you use LWP::Simple, and that no other package doesn't suit solving your problem far better? That's a question that cannot be answered. You give so little information that if you had phrased your question slightly differently (for instance would LWP::Simple be a logical choice, the question wouldn't have been answerable). Now the answer is "yes, but you still may have lots to do yourself".

      I've successfully downloaded and printed the content of the site I need to a file. However, the real information I need is found by going to the site and then typing in the date you want. Originally, I thought each date's data page would have a unique URL so that I could simply getprint the content from each date data page I needed. Unfortunately, these pages do not have unique URLs.

      Is there any way for me to automate inputting each date I need into the text box and then hit Submit to bring up the next page?

        Maybe you want to use the full LWP::UserAgent, or the more browser-like encapsulation of it, WWW::Mechanize? I also recommend reading up on HTTP and how it works, as you'll need a bit of understanding of it if you want to automate websites.

        I've successfully downloaded and printed the content of the site I need to a file. However, the real information I need is found by going to the site and then typing in the date you want.
        I'm confused here. In the first sentence, you claim you've had success, then the second sentence suggests you haven't had success. You can't have it both ways.
        Unfortunately, these pages do not have unique URLs.
        Have you tried doing a GET request with the CGI parameters? Many forms that are set up for POST can actually deal with GET requests as well.
        Is there any way for me to automate inputting each date I need into the text box and then hit Submit to bring up the next page?
        Well, that would be a browser issue. But not being able to do what you want with LWP::Simple doesn't mean the next step up is driving an actual browser. There are other steps as well. LWP::UserAgent, and WWW::Mechanize. They much more suitable to deal with pure HTML forms than LPW::Simple. Of course, it's also possible that the input of the text box first gets manipulated using javascript, or that one or more Ajax calls are involved. In which case, LWP::UserAgent or WWW::Mechanize still wouldn't be much improvement over LWP::Simple.
Re: LWP::Simple... Enough for Site Query & Data Download?
by CountZero (Bishop) on Jun 14, 2009 at 22:16 UTC
    It would be wise to invest some time in reading-up on the HTML-parsers, such as HTML::Parser. Don't even try to find the content you need by using regexes. It will only work for the simplest of cases and is likely to break at the most inopportune moment.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James