As a side project to sharpen my non-existant LWP skills I started writing a small web spider. It requests an html page and stores the file in a directory hierarchy based on the domain (e.g. archive/www/perlmonks/org/index). It has worked for the simple test urls I've given it so far but everytime I look at this section of code I keep thinking there has to be a much better way to do it. Am I on the right track with this, or should I be going about it in a totally different way? Any suggestions to improve the code's performance or clarity would be greatly appreciated.

#!/usr/bin/perl -w use strict; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my %config = ( agent => "cjf/0.0.1", # User agent archive => "archive" ); # data storage root director +y $ua->agent("$config{agent}"); request_page("http://www.perlmonks.org/"); sub request_page { my $url = shift; my $req = HTTP::Request->new(GET => "$url"); $req->header('Accept' => 'text/html'); my $res = $ua->request($req); if (!$res->is_success) { print "Error: " . $res->status_line . "\n"; exit; } $url =~ s#^http://##; print "$url has been retrieved\n"; archive_page($url, $res->content); } sub archive_page { my ($url, $data) = @_; my ($domain, $path) = split('/', $url, 2); my @sections = split /\./, $domain; chdir "$config{archive}" or die "Can't chdir to $config{archive} $ +!\n"; foreach my $section (@sections) { if (-e $section && -d $section) { chdir "$section" or die "Can't change directory to $sectio +n: $!\n"; } else { mkdir "./$section" or die "Can't mkdir $section: $!\n"; chdir "$section" or die "Can't change directory to $sectio +n: $!\n"; } } if ($path) { my ($filename, @directories) = reverse split('/', $path); foreach my $directory (@directories) { if (-e $directory && -d $directory) { chdir "$directory" or die "Can't change directory to $ +directory: $!\n"; } else { mkdir "./$directory" or die "Can't mkdir $directory: $ +!\n"; chdir "$directory" or die "Can't change directory to $ +directory: $!\n"; } } open DATA, ">$filename" or die "Can't create $filename file: $ +!\n"; print DATA $data; close DATA; } else { open DATA, ">index" or die "Can't create data file: $!\n"; print DATA $data; close DATA; } print "$url has been archived\n"; }

In reply to Storing data in a directory hierarchy by cjf

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.