darshan_atha has asked for the wisdom of the Perl Monks concerning the following question:

hi all, can u please help in finding out the size of a page that can be html.i want total size of a page as well size of total images,size of html etc etc.. is there any perl module providing help regarding that. give me the shorter any finer way. thanx.

Replies are listed 'Best First'.
(jeffa) Re: size of page
by jeffa (Bishop) on Oct 11, 2002 at 05:08 UTC
    With HTML::TokeParser and LWP::Simple:
    use strict; use LWP::Simple; use HTML::TokeParser; my $url = shift || 'http://www.google.com'; my $html = get($url); my $size = length($html); print "html has size $size\n"; my $parser = HTML::TokeParser->new(\$html); while (my $tag = $parser->get_tag('img')) { my $img = $tag->[1]->{src}; next unless $img; $img = "$url/$img" unless $img =~ /^http/; my $img_size = (head($img))[1]; print "$img has size $img_size\n"; $size += $img_size; } print "Total size: $size\n";
    It's not perfect ... but it should get you started. ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Specialization in action (and i'm pretty sure this is a complete solution ).

      update: made a tiny think-o, fixing now fixed ;)

      use strict; use LWP::Simple; use HTML::LinkExtractor; my $url = shift || 'http://www.google.com'; my $html = get($url); my $Total = length $html; print "initial size $Total\n"; my $LX = new HTML::LinkExtractor( sub { my( $X, $tag ) = @_; unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_I +N_NEED ) { print "$$tag{tag}\n"; for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}} +} ) { if( exists $$tag{$urlAttr} ) { my $size = (head( $$tag{$urlAttr} ))[1]; $Total += $size if $size; print "adding $size\n" if $size; } } } }, $url, 0 ); $LX->parse(\$html); print "The total size of \n$url\n is $Total bytes\n"; __END__ use Data::Dumper; use HTML::LinkExtractor; print Dumper \@HTML::LinkExtractor::VALID_URL_ATTRIBUTES; print Dumper \%HTML::LinkExtractor::TAGS; print Dumper \@HTML::LinkExtractor::TAGS_IN_NEED;
      The strategy, if our link-type-tag is NOT one of TAGS_IN_NEED ( tags like <a href, which we don't count), and we have a valid link-type-url attribute , then we do a head() and add the size.

      Here is an example run through:

      
      E:\dev\LOOSE>perl sizeapage.pl http://crazyinsomniac.perlmonk.org
      initial size 2018
      img
      adding 24696
      img
      adding 43
      Total size of
      http://crazyinsomniac.perlmonk.org
      is 26757 bytes
      
      E:\dev\LOOSE>
      
      CAVEAT:
      This script doesn't currently handle DTD's (stuff like)
      <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd"> due to a bug in HTML::LinkExtractor (which i'm fixeding at the moment)

      When it's available from cpan, download the latest version of HTML::LinkExtractor and this caveat goes away :)

      Please be aware that if a page has some java/javascript/flash or other dynamic technologies which in turn download stuff not referenced directly on the page, there is no way for to figure that out without parsing java/javascript/flash... which isn't very practical.

      update: Wed Oct 16 09:05:47 2002 GMT ~ after revisiting this old node of mine, I realized that my little snippet doesn't follow frame/layer tags , so I'm rewriting this, and i'll post it in the Code Catacombs eventually. It may be a re-inventing a wheel, but it'll be more to my liking ;)(and more complete)

      ____________________________________________________
      ** The Third rule of perl club is a statement of fact: pod is sexy.

      Thanx for the help it worked , but in case of iframe which internally load other html then it can create problem can u help regarding this matter. thanx.
Re: size of page
by adrianh (Chancellor) on Oct 11, 2002 at 14:00 UTC

    The new HTTP::Size module seems to do exactly what you want.

Re: size of page
by ajt (Prior) on Oct 11, 2002 at 12:49 UTC

    Our very own meryln, wrote a column all about calculating the size and hence download time for a given web page. You can read all about it here, however be aware that it is not complete, as it does not add the weight of a style sheet, nor can it take into account the weight of a image loaded via a style sheet, or ECMA Script call.

    I tweaked the code to included linked style sheets, but it still ignores "@import"ed style sheets, and any image loaded via a style sheet or ECMA, as CSS and ECMA isn't parsed by any of the HTML parsers...

    Update: NONE of the tools that parse HTML alone will give correct total-page sizes as they all miss scripting (ECMA, JavaScript etc. etc.) and CSS linked assets.


    --
    ajt