Hello wise ones,

I was experimenting the use of LWP::* modules aiming to build a tool able to time the download time of an arbitrary web url (www.domain.org | www.domain.org/page.cgi | www.domain.org/path/to/page.cgi). It was pretty simple to get the body of the page but I suddenly realized that was only a skeleton without all the inclueded content (images and so on..).

Then i had the idea to separate the content relative to the base url from the content served by other site.

I have finished with this testing code below but i'm not sure at all it consider all the options of embeddidding/linking methods all over the web.

I'm not even sure about the exhaustiveness of the parsed link (body img src) I used in the example code.

excuse me for a so general question, sure of your patience, waiting for some hint.

Lor*
#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::Parse; use Data::Dumper; $|++; foreach my $url (@ARGV){ my $totsize = 0; my (@intlink,@extlink,@brokenlink ); print"PROCESSING:\t$url\n"; $url = 'http://'.$url; $url =~ s/\s+//g; #delete spaces $url =~s/\/$//; #removing an eventual / as last char my $ua = new LWP::UserAgent; $ua->agent("libwww-perl/5.10.1"); my $response = $ua->get($url); my $body = $response->content; print "body size:\t",length($body),"\n"; $totsize += length($body); my $parsed_html = parse_html($body); for (@{ $parsed_html->extract_links(qw(body img src)) }) { #print "@$ +_\n";next; my ($link) = @$_; # internal included content if ($link =~ /^\// || $link =~ /^$url/) { $link= $url.$link unless $link =~ /^$url/; push @intlink, $link; #DEBUG a:->$link<\n"; } # external included content elsif ($$_[0] =~ /http:\/\//) { push @extlink, $link; #print "DEBUG b:->$link<-\n"; } # ? included content else { push @intlink, $link; #print "DEBUG c:->$link<-\n"; } } print "-" x34,"\n","code\tbytes\tlink\n","-" x34,"\n"; $totsize += (&get_links ($url, @intlink)||0); $totsize += (&get_links ($url, @extlink)||0); print "\n\nTOTSIZE: ".&Arrotonda_Mega($totsize)." ($totsize bytes)\n +" } sub get_links { my $urlbase = shift; my @links = @_; my $totsize; my $ua = new LWP::UserAgent; $ua->agent("libwww-perl/5.10.1"); my $request = HTTP::Request->new('GET'); foreach my $url (@links) { next if $url =~ /^#/; $request->url($url); my $response = $ua->request($request); print $response->code."\t".length($response->content)."\t$url\ +n"; $totsize += length($response->content) } return $totsize; } ###################################################################### +########## sub Arrotonda_Mega { my( $size, $n ) =( shift, 0 ); return "0 bytes" unless defined $size; return "0 bytes" unless $size > 0; ++$n and $size /= 1024 until $size < 1024; return sprintf "%.4f %s", $size, ( qw[ bytes Kb Mb Gb ] )[ $n ]; } ###################################################################### +##########
there are no rules, there are no thumbs..

In reply to Recompose a webpage using LWP::UserAgent and HTML::Parse by Discipulus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.