Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I have to do the same thing to about 100,000 documents. Let's say 10,000 files in 10 directories, accessible via FTP. Get them, transform them, FTP them somewhere else.

I'm doing it like this, pseudocode:

for(@folders){ @list = list of files from folder on ftp server; for (@list){ get file; open, do stuff, close; ftp somewhere else } }
But my problem is, FTP isn't that reliable. The connection has crapped out at the "get file" stage a couple of times.

So now all those "or die" tests the Monks have drilled into me don't seem suitable. It's going to take more than 24 hours to do this even if it all goes smoothly.

But when it goes non-smoothly, what would monks do?

TIA.



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Large FTP task
by roboticus (Chancellor) on May 29, 2006 at 13:17 UTC
    Cody Pendant:

    Short answer: I'd use your first suggestion. You might also want to lookup Net::FTP::AutoReconnect. (I found it when looking up Net::FTP for my answer below.)

    Details: I had a similar problem a couple of years ago. I was using FTP to get files from our mainframe, and it would time out your connection if it detected two minutes of idle time. Unfortunately, sometimes the tape robot would take more than two minutes to get the file.

    So I created a subroutine that would accept a list of files to get. If anything failed, I'd detect it, close the connection and open a new one. I'm going from memory (and the Net::FTP page), but it went something like:

    #/usr/bin/perl -w use strict; use warnings; my $host; my $uid; my $pwd; my $ftph; my %FList = ( 'BCLR7650(-1)' => 'BCL.SDBA.TRANSFER', 'BCLR7651(-1)' => 'BCL.SDBA.TRANSFER', 'BCXD5001(-1)' => 'BCL.SDBA.OUTGOING', ); &GetFiles(%FList); sub Connect { my $ftph = Net::FTP->new($host) or die "Can't connect"; $ftph->login($uid,$pwd) or die "Can't login!"; } sub GetFiles { my $cntFails=1; while ($cntFails>0) { $cntFails=0; &Connect; for my $FName (keys %FList) { next if !defined $FList{$FName}; print "Getting $FName from $FList{$FName}\n"; if ($ftph->cwd($FList)) { if ($ftph->get($FName)) { $FList{$FName} = undef; } else { ++$cntFails; } } else { ++$cntFails; } } $ftph->quit; } }
    It went something like that. Yes, it used crappy global variables and such. (And the directory hierarchy on the mainframe *is* goofy, as it uses "." as directory separators. Yechh!). But at least it got the job done (Unless, of course, you spelled a directory or file name wrong. It was single-minded in it's determination, and would try all night long, trying to get the file.)

    --roboticus

      Hi Roboticus, I have a similar problem. I'm trying to FTP GET certain files from mainframes to zLinux. I was unable to retrieve files on tape. and I tried using the following command FTP->new($MVSADDR,Timeout=>1800,Debug=>1); It seems to work. However if I try to FTP GET files which more than size 100 MB -located either on disk or tape, A zero byte file is downloaded to Linux! I tried the above solution of using while loop. But I still get a zero byte file and a message "250 Transfer completed successfully." I think because of this 250 message, the code exits out of the while loop. How do I get around this problem? Can anyone help? -Thanks, Regards, Gauri
        Gauri:

        I'm afraid I won't be much help there ... it just works on my system. I've retrieved files as large as 3GB without any problems other than finding the space to put the darned thing!

        But here are a couple of things I'd look at:

      • If you can read smaller files but not the 100MB ones, perhaps you should alter the timeout to see if that affects it.
      • Have you tried turning on debug mode in Net::FTP to see if it has any clues for you?
      • A bit tedious--but you might consider monitoring the traffic with Ethereal.

        --roboticus

Re: Large FTP task
by BrowserUk (Patriarch) on May 29, 2006 at 15:07 UTC

    I'd use a short script to process each file individually.

    my( $iUrl, $oUrl ) = @ARGS; system "wget -qc -O $$.in $iUrl" and exit 1; open I, '<', "$$.in" or exit 2; open O, '>', "$$.out" or exit 3; while( I ) { my $modified = .... print O $modified; } close I; unlink "$$.in"; close O; system "wput -q $$.out $oUrl" or exit 4; unlink "$$.out"; exit 0;

    I'd then have a second script that read the list of files and called the first script to process each file in turn. By checking the exit code from the processing script, it can keep track of the state of each file and perform remedial action as required. Once the process script is tested, if I had sufficient bandwidth and permission from the ftp servers, I might then think about using a small pool of threads or processes to achieve some concurrency.

    By using wget and wput that already know how to continue interupted downloads, perform retries and even bandwidth throttling, I avoid having to deal with that myself.

    By using a separate process to handle each file,

    • I avoid having to store all the files locally concurrently;
    • Prevent one bad file from delaying all those that follow it;
    • Simplifiy moving to concurrancy if that is desirable;
    • Isolate the command & control script from the communications processing.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Large FTP task
by salva (Canon) on May 29, 2006 at 11:16 UTC
    you can use and FTP client as lftp that is able to recover from almost any problem.
Re: Large FTP task
by tcf03 (Deacon) on May 29, 2006 at 14:19 UTC
    if this is a one time operation Id use rsync or wget -t10 -o 'mylog.log' -c --glob=on --mirror ftp://somesite.com/dir/* or some variation. Id also break up the operation into subs ie get_list, get_file, transform_file, send_file - at each stange you can log errors and restart from there pushing file names into @retry_get_file, @retry_transform, @retry_send_file etc...

    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson
Re: Large FTP task
by Cody Pendant (Prior) on May 29, 2006 at 21:18 UTC
    Thanks everyone, lots to think about there. I think I'll start with FTP::AutoReconnect, as that's specifically been invented just for this, and if it works, means I don't have to worry about the flow and keeping a note of where I'm up to, althought it might be slower than just powering through skipping errors and cleaning up after..



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
Re: Large FTP task
by Anonymous Monk on Jun 02, 2006 at 23:52 UTC
    I've had the same problem the last week or so (must transfer 11k files in various directories), and have been using Net::FTP::AutoReconnect. The server I'm getting files from is extremely flakey. Net::FTP was painful and N:F:A is slightly less, but it is still very unreliable. It doesn't handle all errors correctly (I end up with benign "Binary mode" responses in the fatal error messages) and completely dumps for other problems. The module looks pretty new at .1, so it's understandable. I'm just now looking into trapping all errors/restarting everything or using a more industrial strength external program (as was suggested) to fetch the files that the script finds. Surprising there isn't something more comprehensive out there after all these years, no?
      PS, the first thing you should do is:
      1. make a local copy of ::AutoReconnect
      2. change all the die statements to return 0
      3. add a sleep(x);$ftp->reconnect(); in place of your usual ftp error handling
        s/return 0/return undef/