cjf has asked for the wisdom of the Perl Monks concerning the following question:

As a side project to sharpen my non-existant LWP skills I started writing a small web spider. It requests an html page and stores the file in a directory hierarchy based on the domain (e.g. archive/www/perlmonks/org/index). It has worked for the simple test urls I've given it so far but everytime I look at this section of code I keep thinking there has to be a much better way to do it. Am I on the right track with this, or should I be going about it in a totally different way? Any suggestions to improve the code's performance or clarity would be greatly appreciated.

#!/usr/bin/perl -w use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my %config = ( agent => "cjf/0.0.1", # User agent archive => "archive" ); # data storage root director +y $ua->agent("$config{agent}"); request_page("http://www.perlmonks.org/"); sub request_page { my $url = shift; my $req = HTTP::Request->new(GET => "$url"); $req->header('Accept' => 'text/html'); my $res = $ua->request($req); if (!$res->is_success) { print "Error: " . $res->status_line . "\n"; exit; } $url =~ s#^http://##; print "$url has been retrieved\n"; archive_page($url, $res->content); } sub archive_page { my ($url, $data) = @_; my ($domain, $path) = split('/', $url, 2); my @sections = split /\./, $domain; chdir "$config{archive}" or die "Can't chdir to $config{archive} $ +!\n"; foreach my $section (@sections) { if (-e $section && -d $section) { chdir "$section" or die "Can't change directory to $sectio +n: $!\n"; } else { mkdir "./$section" or die "Can't mkdir $section: $!\n"; chdir "$section" or die "Can't change directory to $sectio +n: $!\n"; } } if ($path) { my ($filename, @directories) = reverse split('/', $path); foreach my $directory (@directories) { if (-e $directory && -d $directory) { chdir "$directory" or die "Can't change directory to $ +directory: $!\n"; } else { mkdir "./$directory" or die "Can't mkdir $directory: $ +!\n"; chdir "$directory" or die "Can't change directory to $ +directory: $!\n"; } } open DATA, ">$filename" or die "Can't create $filename file: $ +!\n"; print DATA $data; close DATA; } else { open DATA, ">index" or die "Can't create data file: $!\n"; print DATA $data; close DATA; } print "$url has been archived\n"; }

Replies are listed 'Best First'.
Re: Storing data in a directory hierarchy
by snowcrash (Friar) on Mar 26, 2002 at 07:34 UTC
    Check out the mkpath function in File::Path to create a hierarchy of directories without having to use loops.
    Also, you could probably use LWP::Simple's getstore or mirror to save the file.

    snowcrash
Re: Storing data in a directory hierarchy
by Kanji (Parson) on Mar 26, 2002 at 09:41 UTC

    I think you could improve clarity by either renaming request_page to something more descriptive (as it does more than request the page) or by moving the call to archive_page into the main logic so that you get a more immediate idea of script flow...

    my $url = 'http://www.perlmonks.org/'; my $content = request_page($url) or die('...'); my $archive = archive_page($url, $content) or die('...');

    I also echo snowcrash's suggestion to use File::Path and LWP::Simple, but while they're powerful tools, they do have some nuances to be aware of; namely mirror and getstore return a status code of 500 if they can't write to their target file, while tt>mkpath propagates it's errors via die (so you might want to wrap it in an eval).

    I'd also recommend using URI which can simplify your URL parsing and ensure that you're protocol agnostic (as $url =~ s#^http://##; will break if you mirror a non-http:// URL).

    Actually, rolling all three modules together, you can end up with something much simpler; although perhaps still not as readable as one would like. :-)