in reply to size of page

With HTML::TokeParser and LWP::Simple:
use strict; use LWP::Simple; use HTML::TokeParser; my $url = shift || 'http://www.google.com'; my $html = get($url); my $size = length($html); print "html has size $size\n"; my $parser = HTML::TokeParser->new(\$html); while (my $tag = $parser->get_tag('img')) { my $img = $tag->[1]->{src}; next unless $img; $img = "$url/$img" unless $img =~ /^http/; my $img_size = (head($img))[1]; print "$img has size $img_size\n"; $size += $img_size; } print "Total size: $size\n";
It's not perfect ... but it should get you started. ;)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re: (jeffa) Re: size of page
by PodMaster (Abbot) on Oct 11, 2002 at 09:25 UTC
    Specialization in action (and i'm pretty sure this is a complete solution ).

    update: made a tiny think-o, fixing now fixed ;)

    use strict; use LWP::Simple; use HTML::LinkExtractor; my $url = shift || 'http://www.google.com'; my $html = get($url); my $Total = length $html; print "initial size $Total\n"; my $LX = new HTML::LinkExtractor( sub { my( $X, $tag ) = @_; unless( grep {$_ eq $tag->{tag} } @HTML::LinkExtractor::TAGS_I +N_NEED ) { print "$$tag{tag}\n"; for my $urlAttr ( @{$HTML::LinkExtractor::TAGS{$$tag{tag}} +} ) { if( exists $$tag{$urlAttr} ) { my $size = (head( $$tag{$urlAttr} ))[1]; $Total += $size if $size; print "adding $size\n" if $size; } } } }, $url, 0 ); $LX->parse(\$html); print "The total size of \n$url\n is $Total bytes\n"; __END__ use Data::Dumper; use HTML::LinkExtractor; print Dumper \@HTML::LinkExtractor::VALID_URL_ATTRIBUTES; print Dumper \%HTML::LinkExtractor::TAGS; print Dumper \@HTML::LinkExtractor::TAGS_IN_NEED;
    The strategy, if our link-type-tag is NOT one of TAGS_IN_NEED ( tags like <a href, which we don't count), and we have a valid link-type-url attribute , then we do a head() and add the size.

    Here is an example run through:

    
    E:\dev\LOOSE>perl sizeapage.pl http://crazyinsomniac.perlmonk.org
    initial size 2018
    img
    adding 24696
    img
    adding 43
    Total size of
    http://crazyinsomniac.perlmonk.org
    is 26757 bytes
    
    E:\dev\LOOSE>
    
    CAVEAT:
    This script doesn't currently handle DTD's (stuff like)
    <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd"> due to a bug in HTML::LinkExtractor (which i'm fixeding at the moment)

    When it's available from cpan, download the latest version of HTML::LinkExtractor and this caveat goes away :)

    Please be aware that if a page has some java/javascript/flash or other dynamic technologies which in turn download stuff not referenced directly on the page, there is no way for to figure that out without parsing java/javascript/flash... which isn't very practical.

    update: Wed Oct 16 09:05:47 2002 GMT ~ after revisiting this old node of mine, I realized that my little snippet doesn't follow frame/layer tags , so I'm rewriting this, and i'll post it in the Code Catacombs eventually. It may be a re-inventing a wheel, but it'll be more to my liking ;)(and more complete)

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: (jeffa) Re: size of page
by darshan_atha (Acolyte) on Oct 11, 2002 at 08:22 UTC
    Thanx for the help it worked , but in case of iframe which internally load other html then it can create problem can u help regarding this matter. thanx.