Discipulus has asked for the wisdom of the Perl Monks concerning the following question:
my $url; my %stat; # the cache hash where pages and link are accumulated in the +ir keys my $ua = LWP::UserAgent->new; my $parser = HTML::LinkExtor->new; my $resp = $ua->get($url); $parser->parse($resp->content); my $base = $resp->base;
$parser->links return a AoA is safe to select everything where third field is 'src' ? or i have to select based on link type ? only 'frame iframe img input layer script textarea video' tags can have src associated? make sense to grab all of them to repaint the page ?foreach my $link_found( $parser->links ) { next unless $$link_found[1] eq 'src'; my $uriobj = URI->new( $$link_found[2]); my $absurl = $uriobj->abs($base); #if is a frame add to pages adding an iteration to + this sub if ($$link_found[0] eq 'frame'||$$link_found[0] eq ' +iframe') { push @{$stat{'pages'}}, "$absurl"; next } #? need to stringify $absurl #else is a content and we add this to the cache ha +sh $stat{cache}{ $absurl }=[] # will store there leng +th and time later on }
if ($render){ mkdir "$ENV{TEMP}\\_temp_files"||die; open RENDER, "> $ENV{TEMP}/_temp.html"|| die "unable to write to % +TEMP%\\_temp.html"; # locaclize src (my $localcont = $resp->content ) =~s/src="([^"]*)\//src=".\/_te +mp_files\//gm; # translate chars to be filesystem safe $localcont =~ s/(:?src=".\/_temp_files\/)[\?=&,;:]+(:?")/_/gm; print RENDER $localcont; close RENDER; }
With code showed above i get many errors ( binmode on closed filehandle.. )and missing element in the page. Can someone show me a better way to do this? a working regex or a completly different way?# foreach link's $url if ($render){ (my $ele = $url )=~s/^.*\///; $ele =~ s/[\?=&,;:]/_/gm; ##same regex as above? open RENDER, "> $ENV{TEMP}\\_temp_files\\$ele"|| die "unabl +e to write to %TEMP%\\_temp_files\\$ele"; binmode RENDER; print RENDER $resp->content; close RENDER; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: grabbing link and 3 regexes to save HTML to disk
by Athanasius (Archbishop) on Mar 22, 2013 at 13:01 UTC | |
by Discipulus (Canon) on Mar 22, 2013 at 20:41 UTC |