Hello monks,

i'm rewriting the parsing part of my WebTimeLoad because i discovered that HTML::Parse is deprecated so i want to switch to HTML::LinkExtor. I also want to make the render option (save the page on disk and display it) more accurate.

The logic of the program is: get the page content (if a frame is found is pushed into pages queue), parse the content to grab links and put them into some %cache, process all links.

The code use this setup (semplified):
my $url; my %stat; # the cache hash where pages and link are accumulated in the +ir keys my $ua = LWP::UserAgent->new; my $parser = HTML::LinkExtor->new; my $resp = $ua->get($url); $parser->parse($resp->content); my $base = $resp->base;

1)grab all links

foreach my $link_found( $parser->links ) { next unless $$link_found[1] eq 'src'; my $uriobj = URI->new( $$link_found[2]); my $absurl = $uriobj->abs($base); #if is a frame add to pages adding an iteration to + this sub if ($$link_found[0] eq 'frame'||$$link_found[0] eq ' +iframe') { push @{$stat{'pages'}}, "$absurl"; next } #? need to stringify $absurl #else is a content and we add this to the cache ha +sh $stat{cache}{ $absurl }=[] # will store there leng +th and time later on }
$parser->links return a AoA is safe to select everything where third field is 'src' ? or i have to select based on link type ? only 'frame iframe img input layer script textarea video' tags can have src associated? make sense to grab all of them to repaint the page ?

2)modify the page

I want to modify the page before writing it to disk so that all src point to local resource and all web chars not permitted on filesystem are translated ('cause some link is naughty as www.it.org/js/jquery/jquery.color.js?ver=2.0-4561m):
if ($render){ mkdir "$ENV{TEMP}\\_temp_files"||die; open RENDER, "> $ENV{TEMP}/_temp.html"|| die "unable to write to % +TEMP%\\_temp.html"; # locaclize src (my $localcont = $resp->content ) =~s/src="([^"]*)\//src=".\/_te +mp_files\//gm; # translate chars to be filesystem safe $localcont =~ s/(:?src=".\/_temp_files\/)[\?=&,;:]+(:?")/_/gm; print RENDER $localcont; close RENDER; }

3)sanitize in the same way resources to be filesystem safe

# foreach link's $url if ($render){ (my $ele = $url )=~s/^.*\///; $ele =~ s/[\?=&,;:]/_/gm; ##same regex as above? open RENDER, "> $ENV{TEMP}\\_temp_files\\$ele"|| die "unabl +e to write to %TEMP%\\_temp_files\\$ele"; binmode RENDER; print RENDER $resp->content; close RENDER; }
With code showed above i get many errors ( binmode on closed filehandle.. )and missing element in the page. Can someone show me a better way to do this? a working regex or a completly different way?

thanks in advance for the patience
L*
there are no rules, there are no thumbs..

In reply to grabbing link and 3 regexes to save HTML to disk by Discipulus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.