It sounds like you're "linting" incoming HTML. I've done this before, so you'll need to use LWP::Simple or HTTP::Request to HEAD or GET the raw content, used like:
use strict; my $url = "http://www.foo.bar/blort/quux.html"; my $req = HTTP::Request->new(HEAD=>$url); my $ua = LWP::UserAgent->new; my $resp = $ua->request($req); my $type = $resp->header('Content-Type'); my $status_line = $resp->status_line;
Note I'm using LWP::UserAgent in there, and also I'm testing for the return on the HEAD request, to make sure it's a status of 200. If it's anything but a 200, you have to react accordingly (i.e. 404 is a bad url, 500 is an access error, and so on).

Replace HEAD with GET to pull the raw HTML page itself. Ideally you want to test HEAD on the page first, before pulling the content, but that depends on your design, and if you are pulling lots of pages (i.e. a web spider) or one page at a time (upon user request).

You'll also likely want to use URI::Escape to make sure you are handling spaces, @ signs and other "foreign" characters properly as given, so they don't get parsed improperly by your tools or shell. Used like:

use strict; my $url = "http://www.foo.bar/blort/quux.html"; my $safeurl = uri_escape($url); my $newurl = uri_unescape($safeurl); print "URL.....: $url\n"; print "Safe URL: $safeurl\n"; print "New URL.: $newurl\n";

The other modules you may want to use are HTML::LinkExtor (used to extract the links), URI::URL (to play with URI objects), and HTTP::Request (to manipulate the request object).

I'll leave it up to you to find code examples that represent how to use those modules.

You may want to look at Ovid's CGI Course for some more ideas.


In reply to Re: Which modules needed? by hacker
in thread Which modules needed? by venimfrogtongue

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.