venimfrogtongue has asked for the wisdom of the Perl Monks concerning the following question:

Can anyone direct me to the module(s) needed to do the following project? I need to know what I should be reading up on...

I want to setup a form where a use inputs a url. The script will rip and disect the source code from this site and return it so they can view it. It needs to be a module where I can only pull specific tags, I only need meta tags brought back.

Thanks!

VFT

Replies are listed 'Best First'.
Re: Which modules needed?
by fglock (Vicar) on Aug 08, 2002 at 13:03 UTC

    You will find almost everything you need in Bundle::LWP

    If you are writing a command-line program, you could start by reading the source from the GET script that comes with LWP

      Thanks fglock,

      I will definately take a look into that!

      VFT
Re: Which modules needed?
by valdez (Monsignor) on Aug 08, 2002 at 14:04 UTC

    You can use HTML::Head Parser from HTML::Parser to extract tags inside head section. If you need some more specific parser, there are many solutions on CPAN.

    Ciao, Valerio

Re: Which modules needed?
by hacker (Priest) on Aug 08, 2002 at 17:12 UTC
    It sounds like you're "linting" incoming HTML. I've done this before, so you'll need to use LWP::Simple or HTTP::Request to HEAD or GET the raw content, used like:
    use strict; my $url = "http://www.foo.bar/blort/quux.html"; my $req = HTTP::Request->new(HEAD=>$url); my $ua = LWP::UserAgent->new; my $resp = $ua->request($req); my $type = $resp->header('Content-Type'); my $status_line = $resp->status_line;
    Note I'm using LWP::UserAgent in there, and also I'm testing for the return on the HEAD request, to make sure it's a status of 200. If it's anything but a 200, you have to react accordingly (i.e. 404 is a bad url, 500 is an access error, and so on).

    Replace HEAD with GET to pull the raw HTML page itself. Ideally you want to test HEAD on the page first, before pulling the content, but that depends on your design, and if you are pulling lots of pages (i.e. a web spider) or one page at a time (upon user request).

    You'll also likely want to use URI::Escape to make sure you are handling spaces, @ signs and other "foreign" characters properly as given, so they don't get parsed improperly by your tools or shell. Used like:

    use strict; my $url = "http://www.foo.bar/blort/quux.html"; my $safeurl = uri_escape($url); my $newurl = uri_unescape($safeurl); print "URL.....: $url\n"; print "Safe URL: $safeurl\n"; print "New URL.: $newurl\n";

    The other modules you may want to use are HTML::LinkExtor (used to extract the links), URI::URL (to play with URI objects), and HTTP::Request (to manipulate the request object).

    I'll leave it up to you to find code examples that represent how to use those modules.

    You may want to look at Ovid's CGI Course for some more ideas.