Hot Pastrami has asked for the wisdom of the Perl Monks concerning the following question:

Hello, everyone! It's been too long.

I have a quandary... I wish to determine how to grab a webpage through HTTP without using any libraries outside the standard Perl distribution. What's more, I need to it work on any Win32 system, so Lynx is RIGHT OUT.

My aim is to create a single, fully self-contained .pl file which can grab a webpage through HTTP, and then just vomit out the HTML code, nothing spectacular. Workability through a proxy server, while a delightful prospect, is not required. If necessary I can absorb code from modules which have already accomplished as much, but methinks that I have seen, in some foregone posting, a simple routine for doing this thing I describe... but any search provides too many impedimentary matches to be useful. How, praytell, would you kind folk go about coming to the ends I seek?

Many thanks,

Alan "Hot Pastrami" Bellows
-Sitting calmly with scissors-
  • Comment on Grabbing a web page without LWP or the like

Replies are listed 'Best First'.
Re: Grabbing a web page without LWP or the like
by Fastolfe (Vicar) on Nov 21, 2000 at 23:49 UTC
    Why not:
    use IO::Socket; # distributed with Perl my $web = new IO::Socket::INET("www.example.com:80") or die "Couldn't connect: $@"; print $web "GET /some/file HTTP/1.0\n"; print $web "Host: www.example.com\n\n"; $/=''; my $results = <$web>;
    Now just parse out the headers, look for errors, etc. This will not follow redirects (e.g. "/some/directory" -> "/some/directory/"), and is generally only usable for the most basic case of web requests. If you want any real abilities outside of this, you'd be far better off using LWP, or at least reading through it and pulling out the code that you need.
      Oh, that's beautiful... thank you. This is a good indication that I really need to get to know IO::Socket better.

      Alan "Hot Pastrami" Bellows
      -Sitting calmly with scissors-
Re (tilly) 1: Grabbing a web page without LWP or the like
by tilly (Archbishop) on Nov 22, 2000 at 00:06 UTC
    Win32?

    Try Win32::Internet out then. That will allow you to get http, ftp, https, etc while using the correct proxy servers. If you do it right you can even have it cascade through possible modules, so it is portable to Unix as well.

Re: Grabbing a web page without LWP or the like
by japhy (Canon) on Nov 22, 2000 at 00:08 UTC
    I wrote a module that handles this kinda well. I'm thinking of adding redirection support and all, but I'm dangerously close to reinventing the wheel.
    use LWP::FileHandle; lwpopen HOMEPAGE, GET => "http://www.pobox.com/~japhy/" or die "can't access the url: $!" while (<HOMEPAGE>) { print if m!<ul>! .. m!</ul>!; } lwpclose HOMEPAGE;
    Get the module at http://www.pobox.com/~japhy/modules/LWP-FileHandle-0.01.tar.gz. Sorry, no documentation in it (yet), but it's self-explanatory, and comes with a test program.

    japhy -- Perl and Regex Hacker
      pssst....bud! The question said something about "without LWP or the like"

      Update:
      FWIW, I did read the code before posting this. I interpreted use URI::Escape as "the like."

      I apologize if this sounded snippy or mean -- it was meant in humor.

        Psst -- the module doesn't use the LWP suite. If you checked the source to the module, you'd see it just uses the standard IO::Socket module. I put it in the LWP namespace because it's similar in function.

        Tsk, tsk. So quick to prejudge...

        japhy -- Perl and Regex Hacker
Re: Grabbing a web page without LWP or the like
by dws (Chancellor) on Nov 22, 2000 at 01:53 UTC
    Here's one I hacked up a while back when faced with a similar (distribution-only) constraint. I had the particular problem of often needing to see only the response header. Hence the -h option. It isn't foolproof, but it gets the job done.
    #!c:/perl/bin/perl.exe
    # get.pl -- Make an HTTP GET request and report the results
    #
    # Dave Smith, 6/15/00
    
    use strict;
    use IO::Socket;
    
    my $get_or_head = "GET";
    
    my $headeronly = 0;
    if ( $ARGV[0] eq "-h" ) {
        $headeronly = 1;
        $get_or_head = "HEAD";
        shift;
    }
    
    my $url = shift or usage();
    my ($host,$uri) = $url =~ m#^(?:http://|//|)([^/]*)/?(.*)$#;
    # print "host=$host uri=$uri\n";
    usage() if not $host;
    
    
    my $sock = IO::Socket::INET->new(PeerAddr => $host,
                                     PeerPort => 'http(80)',
                                     Proto    => 'tcp');
    die "Couldn't open socket to $host" if not $sock;
    print $sock "$get_or_head /$uri HTTP/1.0\r\n",
                "Accept: text/plain, text/html, text/xml, image/gif\r\n",
                # "If-modified-since: Sat, 14 Jul 2000 01:51:07 GMT\r\n",
                "Host: foo.com\r\n",
                "\r\n";
    
    while ( <$sock> ) {
        s/\r//;
        last if $headeronly and /^$/;
        print;
    }
    
    
    sub usage {
    print <<"END";
    usage:
        $0 [-h] fully-qualified-URL
    
            -h response header only
    END
    
        exit(0);
    }
    
    
      I appreciate all the help, this is useful stuff. However, I have a non-Perl question as relates to this thread... how has the Reputation on the originating thread wandered into the negative (-3 as of now)? I see no unpleasantness in it... have I adopted some irritating ways and then become blind to them? In what way is it deserving of disrepute? Let me know, so that I may not make the mistake again.

      Alan "Hot Pastrami" Bellows

        Some people are probably tired of hearing "I need to drill a hole but I can't be bothered to install a (free) high-quality commercial drill but rather must install something I build myself which won't be as good at drilling."

        I do understand several of the problems with installing modules that lead to the very often repeated requests for how to do things that great modules exist for but without using these great modules. But it doesn't mean that the requests don't get tiring.

        The source code for the modules is freely available so if there is some magic about installing the code that you write, then you can use the module source code in order to rewrite the module yourself. But most of us suggest that you figure out how to install some good quality modules along with whatever code you end up writing and installing.

                - tye (but my friends call me "Tye")
        Ah... a helpful monk pointed out to me that the "--" votes are probably due to the fact that I am trying to reinvent the wheel with this solution, and such is frowned upon. Well, I have 2 things to say in my defence regarding that:
        1. It is clear that the existing wheel dos not fit the vehicle, so I am forced to improvise, and
        2. If no one EVER reinvented the wheel, we'd still be clunking around on some rounded rock with a stick through the middle. Re-exercising an old skill a bit never hurt anybody, and often hones said skill.
        Now I'm wondering how many "--" votes this one will fetch. Ah, well.

        Alan "Hot Pastrami" Bellows
        -Sitting calmly with scissors-
Re: Grabbing a web page without LWP or the like
by arturo (Vicar) on Nov 21, 2000 at 23:54 UTC

    My non-perl answer of the day: lynx runs on Win32

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

      Really? Interesting. However, Lynx still cannot be utilized because it would defeat the .pl file's self-contained requirement. But, thank you all the same.

      Alan "Hot Pastrami" Bellows
      -Sitting calmly with scissors-