Hello wise ones,
I was experimenting the use of LWP::* modules aiming to build a tool able to time the download time of an arbitrary web url (www.domain.org | www.domain.org/page.cgi | www.domain.org/path/to/page.cgi). It was pretty simple to get the body of the page but I suddenly realized that was only a skeleton without all the inclueded content (images and so on..).
Then i had the idea to separate the content relative to the base url from the content served by other site.
I have finished with this testing code below but i'm not sure at all it consider all the options of embeddidding/linking methods all over the web.
I'm not even sure about the exhaustiveness of the parsed link (body img src) I used in the example code.
excuse me for a so general question, sure of your patience, waiting for some hint.
Lor*
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::Parse;
use Data::Dumper;
$|++;
foreach my $url (@ARGV){
my $totsize = 0;
my (@intlink,@extlink,@brokenlink );
print"PROCESSING:\t$url\n";
$url = 'http://'.$url;
$url =~ s/\s+//g; #delete spaces
$url =~s/\/$//; #removing an eventual / as last char
my $ua = new LWP::UserAgent;
$ua->agent("libwww-perl/5.10.1");
my $response = $ua->get($url);
my $body = $response->content;
print "body size:\t",length($body),"\n";
$totsize += length($body);
my $parsed_html = parse_html($body);
for (@{ $parsed_html->extract_links(qw(body img src)) }) { #print "@$
+_\n";next;
my ($link) = @$_;
# internal included content
if ($link =~ /^\// || $link =~ /^$url/) {
$link= $url.$link unless $link =~ /^$url/;
push @intlink, $link;
#DEBUG a:->$link<\n";
}
# external included content
elsif ($$_[0] =~ /http:\/\//) {
push @extlink, $link;
#print "DEBUG b:->$link<-\n";
}
# ? included content
else {
push @intlink, $link;
#print "DEBUG c:->$link<-\n";
}
}
print "-" x34,"\n","code\tbytes\tlink\n","-" x34,"\n";
$totsize += (&get_links ($url, @intlink)||0);
$totsize += (&get_links ($url, @extlink)||0);
print "\n\nTOTSIZE: ".&Arrotonda_Mega($totsize)." ($totsize bytes)\n
+"
}
sub get_links {
my $urlbase = shift;
my @links = @_;
my $totsize;
my $ua = new LWP::UserAgent;
$ua->agent("libwww-perl/5.10.1");
my $request = HTTP::Request->new('GET');
foreach my $url (@links) {
next if $url =~ /^#/;
$request->url($url);
my $response = $ua->request($request);
print $response->code."\t".length($response->content)."\t$url\
+n";
$totsize += length($response->content)
}
return $totsize;
}
######################################################################
+##########
sub Arrotonda_Mega
{
my( $size, $n ) =( shift, 0 );
return "0 bytes" unless defined $size;
return "0 bytes" unless $size > 0;
++$n and $size /= 1024 until $size < 1024;
return sprintf "%.4f %s",
$size, ( qw[ bytes Kb Mb Gb ] )[ $n ];
}
######################################################################
+##########
there are no rules, there are no thumbs..
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.