Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I was looking to write a script that would involve retrieving information from web pages. I found two ways of going about it. It seems that I could either or use LWP::Simple or use a program called "curl". Anybody have any experience with either of these? Do either of them have an advantage over the other that could be useful?

Thanks.

Replies are listed 'Best First'.
Re: Retrieving URLs
by dkubb (Deacon) on Jan 19, 2001 at 08:45 UTC
    Here is a simple script that fetches a web page with LWP::Simple:
    #!/usr/bin/perl -w use strict; use LWP::Simple; my $content = get('http://www.perl.com'); #do something with the $content print $content;

    From your post, I see that you are retrieving information from web pages. You didn't say what sort of information, but here are some of the more popular modules that people use to parse web page elements:

    Short Description HTML::* CPAN Module
    Extract Table Information HTML::TableExtract
    Fetch all URL links on the page HTML::LinkExtor
    Parse and create form attributes HTML::Form
    Generate summary of content HTML::Summary
    Everything else HTML::Parser
Re: Retrieving URLs
by extremely (Priest) on Jan 19, 2001 at 08:19 UTC
    You know, something about this came up already today in: Did I have to roll my own?. Check that out. IF you have installed the LWP package, there should be a program "GET" that I think is better than `curl' as well.

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Retrieving URLs
by cat2014 (Monk) on Jan 19, 2001 at 08:01 UTC
    I haven't used LWP, but I do know that HTML::LinkExtor works fairly well for extracting links. You can read about it on CPAN.

    It's a good way to easily get just the urls from a page. Of course, depending on what you want to do with the urls that you get, you might be better off with LWP. good luck! -- cat

Re: Retrieving URLs
by zeno (Friar) on Jan 19, 2001 at 14:30 UTC
    Oddly enough, I just put an entry into Craft called Use LWP::Simple to download images from a website which shows a (very) simple way of downloading images from a webpage. It could be adapted to get html pages as well.
    I use LWP::Simple's getstore to do this, but if you used get instead, you could store the contents of the webpage in a scalar for parsing, etc.
    For example, you can get the contents of a webpage really easily using LWP::Simple like this, from the command line:
    perl -e "use LWP::Simple;$s=get'http://www.yahoo.com');print $s"
    With similar code you could then parse through the HTML with regular expressions, etc. Good luck! -timallen
Re: Retrieving URLs
by Beatnik (Parson) on Jan 19, 2001 at 14:25 UTC
    You can also use a lynx --dump trick, not to mention the wget trick. Raw socket connections should also work fine, but LWP::Simple (or LWP in general) is by far the cleanest way to do it.

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Retrieving URLs
by ColonelPanic (Friar) on Jan 19, 2001 at 23:03 UTC
    I posted a similar question recently. I ended up using an IO::Socket method by Fastolfe that I found somewhere. That is a standard module that everyone has installed, so it should be a good solution. The disadvantage is, you have to remove the header yourself, and also you have to worry more about error handling. However, it worked readily for me.

    When's the last time you used duct tape on a duct? --Larry Wall
Re: Retrieving URLs
by Anonymous Monk on Jan 19, 2001 at 17:04 UTC
    My problem is that I'm developing a perl script that I don't want the user to have to download any modules. I'd like it to be "self-suffient" for the most part. Is there anything that I can do to achieve this? can someone post an example?

      LWP::Simple is SO useful, everyone should have it installed anyway =) Why are so many of us so big on using modules? Well, if you want to solve a problem, why not use a tool that's been tested again and again and again and found to work? Why bother rewriting something that's already been done WELL?

      A big issue here is how robust you want your script to be -- you can roll your own version, but it's not going to be as versatile and fault-tolerant as one that uses LWP::Simple.

      If all you want is a means of retrieving a web page, then you should be able to rely on your users having something like lynx installed (if they're on *nix-ish systems), or, heck, just tell them to download lynx =). Getting a page via lynx is as simple (as was pointed out above) as lynx --dump <url>.

      If you INSIST on doing it in perl, then you're going to have to understand the HTTP protocol; I won't bother to do the search myself, but I seem to recall "getting a web page without LWP" being a thread on here recently. Good luck!

      Philosophy can be made out of anything. Or less -- Jerry A. Fodor