Grabbing a hundred pages

costas has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Grabbing a hundred pages by Beatnik (Parson) on Jun 22, 2001 at 13:23 UTC
Well don't load em all at once... unless you have a hardware sponsor or your boss doesn't mind you coding bad code :) LWP::Simple is your friend... HTML::Parser probably too. `#!/usr/bin/perl -w use strict; use LWP::Simple; for my $i (1..100) { my $page = get("http://www.some.url.com/$i.html"); foo($page); }` [download] Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply] [d/l]
Re: Grabbing a hundred pages by andye (Curate) on Jun 22, 2001 at 13:28 UTC
You probably want to use the module LWP::Simple, like this... `#!/usr/bin/perl -w use strict; use LWP::Simple; my $page = get('http://www.example.com/'); print $page;` [download] To get a hundred numbered pages, you could do something like this: `my @ary; foreach (1..100) { push @ary, get("http://www.example.com/$_.html") }` [download] andy.	[reply] [d/l] [select]
Re: Re: Grabbing a hundred pages by Anonymous Monk on Jun 22, 2001 at 15:24 UTC
Or: my @pages = map {get("http://www.example.com/$_.html")} 1..100;	[reply]
Re: Grabbing a hundred pages by voyager (Friar) on Jun 22, 2001 at 16:40 UTC
LWP::Simple should be your first choice as noted above. However, it is somewhat limiting and as your are getting > 100 pages you may appreciate the ability to set timeouts, etc. So, only a little more complicated than LWP::Simple: `use LWP::UserAgent; use HTTP::Request::Common qw(GET); my $ua = LWP::UserAgent->new; my $response = $ua->request(GET $rdf->{rdf_url}); my $html = $response->content;` [download] And the parse the HTML as described above. If you do wind up using LWP::Simple, check out the `is_success(mirror($URL, $URL_MIRROR)) { ...` [download] construct so you are only processing html docs you haven't previously processed.	[reply] [d/l] [select]
Re: Grabbing a hundred pages by premchai21 (Curate) on Jun 22, 2001 at 20:54 UTC
You should, as others have suggested, use LWP::Simple. Use cron (*ix) or Task Scheduler (Win32) or whatever the equivalent of that is on Mac to run your script weekly. However, you might not want to put them into variables. Why do you want to run regexen on these pages in the first place? If you want to store them locally and/or put them back, you could set up a directory accessible only by your script, and fetch all the pages into that directory using LWP, then read each one (line by line or as a whole). If you need to run regexen which modify the string, try in-place editing (documented under the -i switch in perlrun). Also, if you're planning on parsing the HTML using regexen, that's generally a bad idea; in that case, look into HTML::Parser instead.	[reply]