Loading A Site Into An Array

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
RE: Loading A Site Into An Array by vroom (His Eminence) on Mar 07, 2000 at 20:55 UTC
Check this out: How can I load a webpage into my program? or How do I fetch an HTML file? hope this helps.. if you need more direction let us know. vroom \| Tim Vroom \| vroom@cs.hope.edu	[reply]
Re: Loading A Site Into An Array by Anonymous Monk on Mar 10, 2000 at 23:19 UTC
Here's an alternative socket based approach. #!/usr/bin/perl use IO::Socket; my $name = "5dayForecast.pl"; my $url = "www.nws.fsu.edu"; my $query = "/wx/data/text/extended/FEUS71.KLWX"; my $socket = IO::Socket::INET->new(PeerAddr => $url, PeerPort => 80, Proto => "tcp", Type => SOCK_STREAM Timeout => 10) or die "$name: Couldn't initialize socket: $@"; my $flags; if ( !defined( $socket->send("GET $query\n", $flags) ) ) { close $socket; print "$name: Can't send: $!\n"; exit; } my $i=0; while (<$socket>) { print; } close $socket; exit;	[reply]
RE: Loading A Site Into An Array by btrott (Parson) on Mar 08, 2000 at 00:43 UTC
First you need to get the web page. Use LWP::Simple or LWP::UserAgent for that. Then you need to extract the links. You can use the solution from the FAQ, How do I extract URLs?, or you can use HTML::LinkExtor, a subclass of HTML::Parser which extracts links. The docs for HTML::LinkExtor contain an example of pretty much exactly what you're looking for.	[reply]
Re: Loading A Site Into An Array by Anonymous Monk on Mar 09, 2000 at 05:52 UTC
You can also do the following: @text = `/usr/bin/lynx -source http://www/`; (note the backwards quotation marks!)	[reply]
Re: Loading A Site Into An Array by Anonymous Monk on Mar 09, 2000 at 00:59 UTC
Coincidence? No such thing; I've been working on a similar problem (building from code a freind provided) <xmp> #!/usr/local/bin/perl use LWP::Simple; open (IN, "sorturls"); @urls = <IN>; foreach $url (@urls){ chop($url); $doc = get "$url"; push (@downloaded_pages, $doc); }; close(IN); </xmp> Your question is a little unclear, though. Do you want the whole page to into a single element of an array> 'Cause, that's what the above will do for each page named in "sorturls" (which has a single url, including "http://" on each line) -T.McWee mcwee@m-net.arbornet.org	[reply]
RE: Re: Loading A Site Into An Array by Anonymous Monk on Mar 10, 2000 at 02:38 UTC
To explain my situation better, I'll give some more details. I would like to make a script that functions similar to anonymizer.com and loads a whole website into an @array. I then plan to replace all `<a href="www.yahoo.com">` tags with `<a href="http://www..com/route.cgi?www.yahoo.com">` So that they go through the script first. The website that it loads could be any website on the internet. Any help would be appreciated, and the help already given is greatly appreciated as well.	[reply] [d/l] [select]
RE: RE: Re: Loading A Site Into An Array by btrott (Parson) on Mar 10, 2000 at 03:20 UTC
You might want to take a look at Randal Schwartz's anonymous proxy server (created for one of his WebTechniques columns). It doesn't try to do what you're doing (replacing links to go through a CGI script), but rather uses your browser's built-in ability to use a proxy server. Anyway, though, for what you asked about, take a look at HTML::LinkExtor (a subclass of HTML::Parser). `perldoc HTML::LinkExtor perldoc LWP::UserAgent` [download] You can use LWP to fetch the web page, then extract the links, then replace each link by a modified version of itself that routes the user through your program.	[reply] [d/l]