Scythe has asked for the wisdom of the Perl Monks concerning the following question:

Hello Learned Ones,

I've recently started dabbling in Perl as I've started a wee hobby project up. It involves logging in to a https site and grabbing a page of data, then parsing out a snippet of info relevant to my needs.

This is all working hunky-dory, with only one small caveat: It uses a lot of data volume. I've set it to grab the page every ten seconds. Due to the nature of the data I'm grabbing this is about the minimum useful update rate. Each time it grabs the page it pulls down about 130-150kB of data. This adds up to significant quantities over any extended period of time. Given Australia's archaic volume limits.

What I'm seeking is a way of minimising the amount of data that $mech->get() will grab off a site. I want to disregard images, .css, essentially anything that isn't plaintext data on the site. I've found ways of changing what is provided by $mech-content(), but this is just formatting the data after it has been $mech->get()'d.

Your guidance in my time of need would be greatly appreciated.

  • Comment on Conserving bandwidth with WWW::Mechanize's get()

Replies are listed 'Best First'.
Re: Conserving bandwidth with WWW::Mechanize's get()
by ikegami (Patriarch) on Jun 05, 2008 at 02:31 UTC
    Just to confirm, WWW::Mechanize ONLY downloads the URL passed to get or specified via other user requests (such as following a link or submitting a form). It does NOT download any embedded objects (images, sounds, flash, etc) or linked documents (style sheets, JavaScript, etc). It DOES follow HTTP redirects and authentication requests (but you can disable that).
Re: Conserving bandwidth with WWW::Mechanize's get()
by pc88mxer (Vicar) on Jun 05, 2008 at 01:26 UTC
    I'm pretty sure by default WWW::Mechanize only downloads the web page and doesn't download referenced images or style sheets. Is your application a spider, and you are only interested in plain-text looking documents?
      It's not exactly a spider, in that is only grabs one piece of info from a single page.

      I had assumed that it downloaded the complete package by monitoring the volume use in an hour and dividing by (60*6), giving me a rough stab at the volume per page. I then saved the page to disc from firefox and the values were roughly the same. 60kB or so of .html file, and 90kB of images and other frippery.

      Is this assumption misguided somehow?

        WWW::Mechanize will automatically handle redirects, but those should be short messages.

        Using LWP::Debug you can get a trace of all the traffic that Mechanize is generating.

        use WWW::Mechanize; use LWP::Debug; my $mech = WWW::Mechanize->new(); LWP::Debug::level("+"); $mech->get("http://www.cnn.com/"); print length($mech->content), "\n";
Re: Conserving bandwidth with WWW::Mechanize's get()
by Gangabass (Vicar) on Jun 05, 2008 at 04:02 UTC

    WWW:Mechanize only load one single page. All you can try is to use WWW::Mechanize::GZip which try to load compressed page (if server support it).

Re: Conserving bandwidth with WWW::Mechanize's get()
by Scythe (Initiate) on Jun 05, 2008 at 04:44 UTC
    As is so often the case with these things, the mistake was on my end. I'd forgotten that my script grabbed multiple pages per update, so it was, in fact, getting four pages at 35kB each, putting me right on my calculated traffic usage.

    Many thanks for everyone's guidance.

      You're welcome. Something about the Monastery having stone floors to sharpen your wit with :)