costas has asked for the wisdom of the Perl Monks concerning the following question:

Before I embark on this job i thought i would ask you guys what you think the best way is of....

Grabbing about a hundred+ html pages from the web (pages conveniently increment ..1.htm, ...2.htm) and putting them into variables to regexed.

I also will want to perform this routine on a weekly basis

Costas

Replies are listed 'Best First'.
Re: Grabbing a hundred pages
by Beatnik (Parson) on Jun 22, 2001 at 13:23 UTC
    Well don't load em all at once... unless you have a hardware sponsor or your boss doesn't mind you coding bad code :)
    LWP::Simple is your friend... HTML::Parser probably too.
    #!/usr/bin/perl -w use strict; use LWP::Simple; for my $i (1..100) { my $page = get("http://www.some.url.com/$i.html"); foo($page); }

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Grabbing a hundred pages
by andye (Curate) on Jun 22, 2001 at 13:28 UTC
    You probably want to use the module LWP::Simple, like this...
    #!/usr/bin/perl -w use strict; use LWP::Simple; my $page = get('http://www.example.com/'); print $page;
    To get a hundred numbered pages, you could do something like this:
    my @ary; foreach (1..100) { push @ary, get("http://www.example.com/$_.html") }

    andy.

      Or:
      my @pages = map {get("http://www.example.com/$_.html")} 1..100;
      
Re: Grabbing a hundred pages
by voyager (Friar) on Jun 22, 2001 at 16:40 UTC
    LWP::Simple should be your first choice as noted above. However, it is somewhat limiting and as your are getting > 100 pages you may appreciate the ability to set timeouts, etc.

    So, only a little more complicated than LWP::Simple:

    use LWP::UserAgent; use HTTP::Request::Common qw(GET); my $ua = LWP::UserAgent->new; my $response = $ua->request(GET $rdf->{rdf_url}); my $html = $response->content;
    And the parse the HTML as described above.

    If you do wind up using LWP::Simple, check out the

    is_success(mirror($URL, $URL_MIRROR)) { ...
    construct so you are only processing html docs you haven't previously processed.
Re: Grabbing a hundred pages
by premchai21 (Curate) on Jun 22, 2001 at 20:54 UTC
    You should, as others have suggested, use LWP::Simple. Use cron (*ix) or Task Scheduler (Win32) or whatever the equivalent of that is on Mac to run your script weekly.

    However, you might not want to put them into variables. Why do you want to run regexen on these pages in the first place? If you want to store them locally and/or put them back, you could set up a directory accessible only by your script, and fetch all the pages into that directory using LWP, then read each one (line by line or as a whole). If you need to run regexen which modify the string, try in-place editing (documented under the -i switch in perlrun).

    Also, if you're planning on parsing the HTML using regexen, that's generally a bad idea; in that case, look into HTML::Parser instead.