Ionizor has asked for the wisdom of the Perl Monks concerning the following question:

I've written up this little script that will print a range of sequential URLs based on an input URL. The idea is to generate a list of files suitable for use with wget. I was just wondering if there were any obvious improvements I could make to the regexes I'm using to extract the pathnames (or a module to eliminate them entirely) or just general improvements that could be made to the script.

Any help is much appreciated. Thanks.

#!/usr/bin/perl require 5.6.1; use strict; use warnings; die "Usage: listseq.pl <url> [min] <max>\n" unless (@ARGV >= 2 and @AR +GV <= 3); # Retrieve the URL my $extract = shift; # Sort out the range of numbered files we should be getting my $min; if (@ARGV >= 2) { $min = shift; } else { $min = 1; } my $max = shift; # Separate the path from the filename and save both $extract =~ /^(\S*\/)(\S*?)$/; my ($filepath, $filename) = ($1, $2); # Pull the filename and extension from the file; determine precision ( +1 vs 01) $filename =~ /^(\S+?)(\d+)(\.\S+)$/; my ($name, $numlength, $extension) = ($1, length($2), $3); # Print the list of filenames for (my $i = $min; $i <= $max; $i++) { print ($filepath, $name, (sprintf "%0${numlength}d", $i), $extension +, "\n"); }

--
Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

Replies are listed 'Best First'.
Re: Downloading a range of sequential files
by revdiablo (Prior) on Jul 25, 2003 at 03:03 UTC

    I have a script exactly like this. It was one of my Very First Perl Scripts Ever (tm). I called it... wait for it... urlrange. :) One thing you might want to add is a way to specify the precision of numbers in the range. Your automagic precision detection will work in most cases, but occasionally it's nice to be able to specify.

    Also, you might want to make a way to specify a suffix, rather than one being automatically determined. My script takes it's args as prefix first last suffix. So if I had http://blah.com/file01-blah.txt ..., I would run: urlrange http://www.blah.com/file0 1 5 -blah.txt.

    Hopefully you will find this useful.

    Update: you can look at my version at my website.

      Specifying precision - good idea. Thanks!

      As far as suffix being automatically determined, what I had didn't actually work very well since any file that didn't end in a number would cause a script error. I've since added in automatic detection and a die to gracefully catch non-numeric filenames. Most of the files I'm listing end in numbers anyway, so it wasn't that big a problem for me. Eventually I'm planning to implement a switch (--activenumber="3" or something like that) that allows me to specify the number to increment / decrement (in case of filenames like "results-2000-01-25-final.html").

      One of my design goals with this was to make it as automagic as possible - I'm lazy so I want to be able to just copy and paste the URLs, add parameters to the end and hit enter.

      --
      Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

        Note that my paramater order was designed to be done as lazily as possible too. Using your example, say I had the url http://foobar.com/results-2000-01-25-final.html and I wanted to get 2000-01-25 through 2000-01-28. My sequence of operations would go something like so.

        First I would type the command:

        wget `urlrange `

        Then paste the url:

        wget `urlrange http://foobar.com/results-2000-01-25-final.html`

        And finally all I have to do is move my cursor over to the 25, add a space before and after, and put the last digit of my range. My result would be something like:

        wget `urlrange http://foobar.com/results-2000-01- 25 28 -final.html`

        Which, I would say, is about as lazy as it can get (in terms of number of keystrokes). But... to each his own, I suppose. Your interface seems to work fine for you.

Re: Downloading a range of sequential files
by Nkuvu (Priest) on Jul 25, 2003 at 03:23 UTC
(jeffa) Re: Downloading a range of sequential files
by jeffa (Bishop) on Jul 25, 2003 at 15:23 UTC
    I don't have time to explain this right now ... but here is a modification of your code that uses plethora of CPAN modules. Hope this helps. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      Quite helpful indeed. Thank you!

      Getopt::Long; and Pod::Usage; seem like they would be quite useful.

      --
      Grant me the wisdom to shut my mouth when I don't know what I'm talking about.