comment on

Hello wise ones,

I was experimenting the use of LWP::* modules aiming to build a tool able to time the download time of an arbitrary web url (www.domain.org | www.domain.org/page.cgi | www.domain.org/path/to/page.cgi). It was pretty simple to get the body of the page but I suddenly realized that was only a skeleton without all the inclueded content (images and so on..).

Then i had the idea to separate the content relative to the base url from the content served by other site.

I have finished with this testing code below but i'm not sure at all it consider all the options of embeddidding/linking methods all over the web.

I'm not even sure about the exhaustiveness of the parsed link (body img src) I used in the example code.

excuse me for a so general question, sure of your patience, waiting for some hint.

Lor*

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;
use HTML::Parse;
use Data::Dumper;
$|++;


foreach my $url (@ARGV){
 my $totsize = 0;
 my (@intlink,@extlink,@brokenlink );
 print"PROCESSING:\t$url\n";
 $url = 'http://'.$url;
 $url =~ s/\s+//g;     #delete spaces
 $url =~s/\/$//;       #removing an eventual / as last char
 my $ua = new LWP::UserAgent;
 $ua->agent("libwww-perl/5.10.1");
 my $response = $ua->get($url);
 my $body =  $response->content;
 print "body size:\t",length($body),"\n";
 $totsize += length($body);
 
 my $parsed_html = parse_html($body);

 for (@{ $parsed_html->extract_links(qw(body img src)) }) { #print "@$
+_\n";next;
 
    my ($link) = @$_;
    # internal included content
    if ($link =~ /^\// || $link =~ /^$url/) {
        $link= $url.$link unless $link =~ /^$url/;
        push @intlink, $link;
        #DEBUG a:->$link<\n";

    }
    # external included content
    elsif ($$_[0] =~ /http:\/\//) {
        push @extlink, $link;
        #print "DEBUG b:->$link<-\n";
    }
    # ? included content
    else {
        push @intlink, $link;
        #print "DEBUG c:->$link<-\n";
    }

  }
  print "-" x34,"\n","code\tbytes\tlink\n","-" x34,"\n";
  $totsize += (&get_links ($url, @intlink)||0);
  $totsize += (&get_links ($url, @extlink)||0);
  print "\n\nTOTSIZE: ".&Arrotonda_Mega($totsize)." ($totsize bytes)\n
+"
}



sub  get_links {
      my $urlbase = shift;
      my @links = @_;
      my $totsize;
      my $ua = new LWP::UserAgent;
      $ua->agent("libwww-perl/5.10.1");
      my $request = HTTP::Request->new('GET');

      foreach my $url (@links) {
      next if $url =~ /^#/;
      $request->url($url);
        my $response = $ua->request($request);
        print $response->code."\t".length($response->content)."\t$url\
+n";
        $totsize += length($response->content)
   }
   return  $totsize;
}
######################################################################
+##########
sub  Arrotonda_Mega
 {
   my( $size, $n ) =( shift, 0 );
   return "0 bytes" unless defined $size;
   return "0 bytes" unless  $size > 0;
   ++$n and $size /= 1024 until $size < 1024;
    return sprintf "%.4f %s",
          $size, ( qw[ bytes Kb Mb Gb ] )[ $n ];
 }
######################################################################
+##########
[download]

there are no rules, there are no thumbs..

In reply to Recompose a webpage using LWP::UserAgent and HTML::Parse by Discipulus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.