in reply to (jeffa) Re: size of page
in thread size of page

Specialization in action (and i'm pretty sure this is a complete solution ).

update: made a tiny think-o, fixing now fixed ;)

use strict; use LWP::Simple; use HTML::LinkExtractor; my $url = shift || 'http://www.google.com'; my $html = get($url); my $Total = length $html; print "initial size $Total\n"; my $LX = new HTML::LinkExtractor( sub { my( $X, $tag ) = @_; unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_I +N_NEED ) { print "$$tag{tag}\n"; for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}} +} ) { if( exists $$tag{$urlAttr} ) { my $size = (head( $$tag{$urlAttr} ))[1]; $Total += $size if $size; print "adding $size\n" if $size; } } } }, $url, 0 ); $LX->parse(\$html); print "The total size of \n$url\n is $Total bytes\n"; __END__ use Data::Dumper; use HTML::LinkExtractor; print Dumper \@HTML::LinkExtractor::VALID_URL_ATTRIBUTES; print Dumper \%HTML::LinkExtractor::TAGS; print Dumper \@HTML::LinkExtractor::TAGS_IN_NEED;
The strategy, if our link-type-tag is NOT one of TAGS_IN_NEED ( tags like <a href, which we don't count), and we have a valid link-type-url attribute , then we do a head() and add the size.

Here is an example run through:


E:\dev\LOOSE>perl sizeapage.pl http://crazyinsomniac.perlmonk.org
initial size 2018
img
adding 24696
img
adding 43
Total size of
http://crazyinsomniac.perlmonk.org
is 26757 bytes

E:\dev\LOOSE>
CAVEAT:
This script doesn't currently handle DTD's (stuff like)
<!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd"> due to a bug in HTML::LinkExtractor (which i'm fixeding at the moment)

When it's available from cpan, download the latest version of HTML::LinkExtractor and this caveat goes away :)

Please be aware that if a page has some java/javascript/flash or other dynamic technologies which in turn download stuff not referenced directly on the page, there is no way for to figure that out without parsing java/javascript/flash... which isn't very practical.

update: Wed Oct 16 09:05:47 2002 GMT ~ after revisiting this old node of mine, I realized that my little snippet doesn't follow frame/layer tags , so I'm rewriting this, and i'll post it in the Code Catacombs eventually. It may be a re-inventing a wheel, but it'll be more to my liking ;)(and more complete)

____________________________________________________
** The Third rule of perl club is a statement of fact: pod is sexy.