Which modules needed?

venimfrogtongue has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Which modules needed? by fglock (Vicar) on Aug 08, 2002 at 13:03 UTC
You will find almost everything you need in Bundle::LWP If you are writing a command-line program, you could start by reading the source from the GET script that comes with LWP	[reply]
Re: Re: Which modules needed? by venimfrogtongue (Novice) on Aug 08, 2002 at 13:07 UTC
Thanks fglock, I will definately take a look into that! VFT	[reply]
Re: Which modules needed? by valdez (Monsignor) on Aug 08, 2002 at 14:04 UTC
You can use HTML::Head Parser from HTML::Parser to extract tags inside head section. If you need some more specific parser, there are many solutions on CPAN. Ciao, Valerio	[reply]
Re: Which modules needed? by hacker (Priest) on Aug 08, 2002 at 17:12 UTC
It sounds like you're "linting" incoming HTML. I've done this before, so you'll need to use LWP::Simple or HTTP::Request to HEAD or GET the raw content, used like: `use strict; my $url = "http://www.foo.bar/blort/quux.html"; my $req = HTTP::Request->new(HEAD=>$url); my $ua = LWP::UserAgent->new; my $resp = $ua->request($req); my $type = $resp->header('Content-Type'); my $status_line = $resp->status_line;` [download] Note I'm using LWP::UserAgent in there, and also I'm testing for the return on the HEAD request, to make sure it's a status of 200. If it's anything but a 200, you have to react accordingly (i.e. 404 is a bad url, 500 is an access error, and so on). Replace HEAD with GET to pull the raw HTML page itself. Ideally you want to test HEAD on the page first, before pulling the content, but that depends on your design, and if you are pulling lots of pages (i.e. a web spider) or one page at a time (upon user request). You'll also likely want to use URI::Escape to make sure you are handling spaces, @ signs and other "foreign" characters properly as given, so they don't get parsed improperly by your tools or shell. Used like: `use strict; my $url = "http://www.foo.bar/blort/quux.html"; my $safeurl = uri_escape($url); my $newurl = uri_unescape($safeurl); print "URL.....: $url\n"; print "Safe URL: $safeurl\n"; print "New URL.: $newurl\n";` [download] The other modules you may want to use are HTML::LinkExtor (used to extract the links), URI::URL (to play with URI objects), and HTTP::Request (to manipulate the request object). I'll leave it up to you to find code examples that represent how to use those modules. You may want to look at Ovid's CGI Course for some more ideas.	[reply] [d/l] [select]