I have a need to traverse a web tree remotely over http, parse a list of directories which come back, and grab the latest or second-to-latest that are displayed.

Once I have that, I need to fetch some files within that directory by name (which includes the date in the title of the filename).

For example, I will see something like this:

   Parent Directory/                      -    Directory
   20060922/         2006-Nov-13 01:11:31 -    Directory
   20060927/         2006-Nov-13 01:16:45 -    Directory
   20061016/         2006-Dec-25 03:16:32 -    Directory
   20061103/         2006-Dec-25 03:18:05 -    Directory
   20061202/         2007-Jan-30 18:07:53 -    Directory
   20061224/         2007-Feb-13 23:23:44 -    Directory
   20070126/         2007-Mar-11 19:16:45 -    Directory
   20070208/         2007-Feb-09 03:04:34 -    Directory
   20070225/         2007-Feb-25 23:44:05 -    Directory

From here, I can see that I want either

20070225
or
20070208
as the latest and second-to-latest directories in the tree.

Once I know this, I need to traverse into one of those directories and fetch a series of files, which have the date in the filename. These files are VERY enormous (tens of gigabytes in size)

What is the best approach to solve this problem, keeping in mind that this is over http, remotely, and the ability to resume aborted fetches is highly critical (ala wget -c).

Here is the order of events:

  1. Connect to directory resource and fetch html page that lists directories available
  2. Parse the list, sorting and retrieving the latest two most-recent directories
  3. Traverse into one or the other, starting with second-to-latest, and fetch file-$DATE-001.dat .. n, resuming where required from previous aborted fetches.
  4. Store locally, verifying full transfer, and delete any other local instances of previous directories that remain (thus keeping a "mirror" of only the latest two remote copies).

Which modules should I be exploring, other than the obvious LWP, WWW::Robot, File::Path, Date::Calc, Date::Manip and such?

Are there any canned routines or snippets somewhere that can help? Or in the absence of that, a tutorial that goes through some of this?


In reply to Traversing directories to get the "most-recent" or "second-to-most-recent" directory contents by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.