water has asked for the wisdom of the Perl Monks concerning the following question:

Hi --

Seeking to avoid re-inventing the wheel, can anyone point me towards anything pre-rolled to grab a web page and return the size of the page in bytes, both source and total (eg including the referenced images)? If I need to roll my own, I'm thinking of a mix of WWW::Mechanize, HTML::Parser, and Image::Size. But would rather use something existing, either perl, or a a perl wrap a linux command line app. Thanks for any ideas

water

  • Comment on determine web page size, w/ and w/o images

Replies are listed 'Best First'.
Re: determine web page size, w/ and w/o images
by revdiablo (Prior) on Jun 15, 2004 at 03:20 UTC

    You might be able to determine a rough size without having to download all the images. If you download the HTML page, you could then get a HEAD for each image it has, and add up all the reported sizes. Of course, this relies on the web server reporting a proper size, and I don't know how reliable this is, but it seems like it should be pretty good. (Doesn't HTTP use the Content-Length to determine how much to download? If so, then the Content-Length would probably be pretty accurate.)

    HTH

    Update: after re-reading, I'm not sure if I really answered your questions at all, but this is a thought I had. Perhaps it will at least give you some ideas...

Re: determine web page size, w/ and w/o images
by mhi (Friar) on Jun 15, 2004 at 05:08 UTC
Re: determine web page size, w/ and w/o images
by Popcorn Dave (Abbot) on Jun 15, 2004 at 02:51 UTC
    If you had a page that followed the W3C standards, couldn't you grab the page with LWP::Simple, store that to a file, get the size, and use HTML::TokeParser to parse out the image sizes?

    Update: After thinking about this on a long drive today ( and checking the W3C website to make sure ) I realized that all you're going to be able to parse are the alloted pixel sizes of a web page. I beleive that other monks in the thread have pointed out ways to get image sizes.

    There is no emoticon for what I'm feeling now.
Re: determine web page size, w/ and w/o images
by data64 (Chaplain) on Jun 15, 2004 at 04:42 UTC

    Using wget might be much easier in this situation. It has a bunch of switches to download all the graphics related to the page. Curl and ht-track also do the same thing, but I am not familiar with them.


    Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd

•Re: determine web page size, w/ and w/o images
by merlyn (Sage) on Jun 15, 2004 at 15:15 UTC
Re: determine web page size, w/ and w/o images
by brian_d_foy (Abbot) on Jun 18, 2004 at 06:16 UTC
    I wrote HTTP::Size to do just this. From the docs:
    get_sizes( URL, BASE_URL )

    The get_sizes function is like get_size, although for HTML pages it also fetches all of the images then sums the sizes of the original page and image sizes. It returns a total download size. In list context it returns the total download size and a hash reference whose keys are the URLs of the images found in the HTML and whose values are hash references with these keys:
    --
    brian d foy <bdfoy@cpan.org>
      Great! Thanks!

      I couldn't tell from the docs -- does the module handle redirects? Included external CSS and javascript files etc?

      Curious, thanks!