Specialization in action (and i'm pretty sure this is a complete solution ).

update: made a tiny think-o, fixing now fixed ;)

use strict; use LWP::Simple; use HTML::LinkExtractor; my $url = shift || 'http://www.google.com'; my $html = get($url); my $Total = length $html; print "initial size $Total\n"; my $LX = new HTML::LinkExtractor( sub { my( $X, $tag ) = @_; unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_I +N_NEED ) { print "$$tag{tag}\n"; for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}} +} ) { if( exists $$tag{$urlAttr} ) { my $size = (head( $$tag{$urlAttr} ))[1]; $Total += $size if $size; print "adding $size\n" if $size; } } } }, $url, 0 ); $LX->parse(\$html); print "The total size of \n$url\n is $Total bytes\n"; __END__ use Data::Dumper; use HTML::LinkExtractor; print Dumper \@HTML::LinkExtractor::VALID_URL_ATTRIBUTES; print Dumper \%HTML::LinkExtractor::TAGS; print Dumper \@HTML::LinkExtractor::TAGS_IN_NEED;
The strategy, if our link-type-tag is NOT one of TAGS_IN_NEED ( tags like <a href, which we don't count), and we have a valid link-type-url attribute , then we do a head() and add the size.

Here is an example run through:


E:\dev\LOOSE>perl sizeapage.pl http://crazyinsomniac.perlmonk.org
initial size 2018
img
adding 24696
img
adding 43
Total size of
http://crazyinsomniac.perlmonk.org
is 26757 bytes

E:\dev\LOOSE>
CAVEAT:
This script doesn't currently handle DTD's (stuff like)
<!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd"> due to a bug in HTML::LinkExtractor (which i'm fixeding at the moment)

When it's available from cpan, download the latest version of HTML::LinkExtractor and this caveat goes away :)

Please be aware that if a page has some java/javascript/flash or other dynamic technologies which in turn download stuff not referenced directly on the page, there is no way for to figure that out without parsing java/javascript/flash... which isn't very practical.

update: Wed Oct 16 09:05:47 2002 GMT ~ after revisiting this old node of mine, I realized that my little snippet doesn't follow frame/layer tags , so I'm rewriting this, and i'll post it in the Code Catacombs eventually. It may be a re-inventing a wheel, but it'll be more to my liking ;)(and more complete)

____________________________________________________
** The Third rule of perl club is a statement of fact: pod is sexy.


In reply to Re: (jeffa) Re: size of page by PodMaster
in thread size of page by darshan_atha

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.