Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Sorry to bother you all, but I have been looking for a few weeks for a way to be able to load other people's web pages. I need to be able to get a web page from an url and then load it into an @array. I have failed to find any documentation on it so far, any help would be greatly appreciated.

Replies are listed 'Best First'.
RE: Loading A Site Into An Array
by vroom (His Eminence) on Mar 07, 2000 at 20:55 UTC
Re: Loading A Site Into An Array
by Anonymous Monk on Mar 10, 2000 at 23:19 UTC
    Here's an alternative socket based approach.
    
    #!/usr/bin/perl
    
    use IO::Socket;
    
    my $name = "5dayForecast.pl";
    
    my $url = "www.nws.fsu.edu";
    
    my $query = "/wx/data/text/extended/FEUS71.KLWX";
    
    my $socket = IO::Socket::INET->new(PeerAddr => $url, 
                                       PeerPort => 80, 
                                       Proto    => "tcp", 
                                       Type     => SOCK_STREAM
                                       Timeout  => 10)
        or die "$name: Couldn't initialize socket: $@"; 
    
    my $flags;
    
    if ( !defined( $socket->send("GET $query\n", $flags) ) ) {
        close $socket;
        print "$name: Can't send: $!\n";
        exit;
    }
    
    my $i=0;
    while (<$socket>) {
        print;
    }
    
    close $socket;
    
    exit;
    
RE: Loading A Site Into An Array
by btrott (Parson) on Mar 08, 2000 at 00:43 UTC
Re: Loading A Site Into An Array
by Anonymous Monk on Mar 09, 2000 at 05:52 UTC
    You can also do the following: @text = `/usr/bin/lynx -source http://www/`; (note the backwards quotation marks!)
Re: Loading A Site Into An Array
by Anonymous Monk on Mar 09, 2000 at 00:59 UTC
    Coincidence? No such thing; I've been working on a similar problem (building from code a freind provided)

    <xmp> #!/usr/local/bin/perl use LWP::Simple; open (IN, "sorturls"); @urls = <IN>; foreach $url (@urls){ chop($url); $doc = get "$url"; push (@downloaded_pages, $doc); }; close(IN); </xmp>

    Your question is a little unclear, though. Do you want the whole page to into a single element of an array> 'Cause, that's what the above will do for each page named in "sorturls" (which has a single url, including "http://" on each line)

    -T.McWee

    mcwee@m-net.arbornet.org

      To explain my situation better, I'll give some more details. I would like to make a script that functions similar to anonymizer.com and loads a whole website into an @array. I then plan to replace all <a href="www.yahoo.com"> tags with <a href="http://www..com/route.cgi?www.yahoo.com"> So that they go through the script first. The website that it loads could be any website on the internet. Any help would be appreciated, and the help already given is greatly appreciated as well.
        You might want to take a look at Randal Schwartz's anonymous proxy server (created for one of his WebTechniques columns).

        It doesn't try to do what you're doing (replacing links to go through a CGI script), but rather uses your browser's built-in ability to use a proxy server.

        Anyway, though, for what you asked about, take a look at HTML::LinkExtor (a subclass of HTML::Parser).

        perldoc HTML::LinkExtor perldoc LWP::UserAgent
        You can use LWP to fetch the web page, then extract the links, then replace each link by a modified version of itself that routes the user through your program.