comment on

I think you could improve clarity by either renaming request_page to something more descriptive (as it does more than request the page) or by moving the call to archive_page into the main logic so that you get a more immediate idea of script flow...

my $url     = 'http://www.perlmonks.org/';
my $content = request_page($url)           or die('...');
my $archive = archive_page($url, $content) or die('...');
[download]

I also echo snowcrash's suggestion to use File::Path and LWP::Simple, but while they're powerful tools, they do have some nuances to be aware of; namely mirror and getstore return a status code of 500 if they can't write to their target file, while tt>mkpath propagates it's errors via die (so you might want to wrap it in an eval).

I'd also recommend using URI which can simplify your URL parsing and ensure that you're protocol agnostic (as $url =~ s#^http://##; will break if you mirror a non-http:// URL).

Actually, rolling all three modules together, you can end up with something much simpler; although perhaps still not as readable as one would like. :-)

#!/usr/bin/perl

use File::Path  qw/ mkpath                  /;
use LWP::Simple qw/ getstore is_success $ua /;
use URI;

use strict;
use warnings;

  # ---------------------------------------------------- #

my %config  = (
  'agent'   => 'cjf/0.0.1',
  'archive' => 'archive',
);

  # ---------------------------------------------------- #

my $url     = shift || 'http://www.perlmonks.org/'; # :-)

  # Convert $url into a file path.

my $file    = join '/', 
                $config{'archive'}, 
                url_as_path($url);

die "Couldn't parse $url\n" if $file eq $config{'archive'};

  # Determine and create the directory (and any 
  # missing directories above) that $file will 
  # be saved to.  Errors get propagated via die().

my($path)   = $file =~ m<(.*)/[^/]+$>;
mkpath( [$path], 0, 0755 );

  # Configure our 'browser', and fetch/archive the 
  # requested URL.

$ua->agent( $config{'agent'} );

my $rc      = getstore( $url => $file );

die "Couldn't archive $url: $rc\n" unless is_success($rc);

  # ---------------------------------------------------- #

sub url_as_path {

    # Convert an URI to a Unix-style path, sans 
    # any protocol

    my $url   = URI->new(shift);

   (my $path  = $url->host) =~ tr[.][/];
       $path .= $url->path;

    # Assign default name if $url appears to be a 
    # directory instead of a file.

       $path .= 'index.html' if substr( $url->path, -1 ) eq "/" ;

       $path;       
}
[download]

--k.

In reply to Re: Storing data in a directory hierarchy by Kanji
in thread Storing data in a directory hierarchy by cjf

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.