use strict;
use LWP::Simple;
use HTML::LinkExtractor;
my $url = shift || 'http://www.google.com';
my $html = get($url);
my $Total = length $html;
print "initial size $Total\n";
my $LX = new HTML::LinkExtractor(
sub {
my( $X, $tag ) = @_;
unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_I
+N_NEED ) {
print "$$tag{tag}\n";
for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}}
+} ) {
if( exists $$tag{$urlAttr} ) {
my $size = (head( $$tag{$urlAttr} ))[1];
$Total += $size if $size;
print "adding $size\n" if $size;
}
}
}
},
$url,
0
);
$LX->parse(\$html);
print "The total size of \n$url\n is $Total bytes\n";
__END__
use Data::Dumper;
use HTML::LinkExtractor;
print Dumper \@HTML::LinkExtractor::VALID_URL_ATTRIBUTES;
print Dumper \%HTML::LinkExtractor::TAGS;
print Dumper \@HTML::LinkExtractor::TAGS_IN_NEED;
The strategy, if our link-type-tag is NOT one of
TAGS_IN_NEED ( tags like <a href, which we don't count), and we have a valid link-type-url attribute , then we do a head() and add the size.
Here is an example run through:
E:\dev\LOOSE>perl sizeapage.pl http://crazyinsomniac.perlmonk.org
initial size 2018
img
adding 24696
img
adding 43
Total size of
http://crazyinsomniac.perlmonk.org
is 26757 bytes
E:\dev\LOOSE>
CAVEAT:
This script doesn't currently handle DTD's (stuff like)
<!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
due to a bug in HTML::LinkExtractor (which i'm fixeding at the moment)
When it's available from cpan, download the latest version of HTML::LinkExtractor and this caveat goes away :)
Please be aware that if a page has some java/javascript/flash or other dynamic technologies which in turn download stuff not referenced directly on the page, there is no way for to figure that out without parsing java/javascript/flash... which isn't very practical.
update: Wed Oct 16 09:05:47 2002 GMT ~ after revisiting this old node of mine, I realized that my little snippet doesn't follow frame/layer tags , so I'm rewriting this, and i'll post it in the Code Catacombs eventually. It may be a re-inventing a wheel, but it'll be more to my liking ;)(and more complete)
____________________________________________________ ** The Third rule of perl club is a statement of fact: pod is sexy. |