in reply to read directory, fork processes

One addition I would make to ikegami's post, is when doing your file discovery process in the main thread, rename the files into a work directory before queuing the new name. If the target of rename is on the same disk, it will take very little time regardless of the size of the file and will keep the arrivals directory clear of known files, greatly simplifying the next phase of the discovery process.

Also, if you can make the the final (post-processing) destination (the backup location), on another drive, that will help with disk head thrash. But don't rename them there immediately, as that would require a copy operation and slow thing down again.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"

Replies are listed 'Best First'.
Re^2: read directory, fork processes
by roboticus (Chancellor) on Feb 24, 2010 at 12:13 UTC

    We use a variation of the technique BrowserUk suggests, and it works quite well. Any process dropping something in the work queue directory initially creates the file with a ".work" extension. When the file is completed, then it's renamed to remove the extension. And (of course) the processor ignores all files with the extension.

    ...roboticus

      roboticus, do you mind sharing your code?

        KevinBr:

        No, I don't mind. Here's a stripped down version of the report processor:

        #!/usr/bin/perl -w use strict; use warnings; use myUtils; ### # CONFIG ### my %Apps = ( CSV_Reports => { cmdline=>'CSV_reports.pl', ext=>'.csv', inbound=>'/Rpts', outbound=>'/Rpts/CSV' }, Excel_Reports => { cmdline=>'NME_report.pl', ext=>'.nme', inbound=>'/Rpts', outbound=>'/Rpts/XL', }, ); my $DozeTime = 5; #5 * 60; # 5 min my $DieFile = '/var/log/report_processor.halt'; ### # CODE ### open my $LOG, '>>', '/var/log/report_processor.log' or die $!; print $LOG join("\n", box_fixed(80, 'Reports Log', 'Started at: ' . myUtils::timestamp() ), ), "\n\n"; while (! -e $DieFile) { for my $RName (keys %Apps) { chdir $Apps{$RName}{inbound}; my @files = map { s/\s+$//; $_ } # and chomp the names grep { /$Apps{$RName}{ext}$/ } # files for this repo +rt qx( ls -1 --file-type ); # check inbound dir f +or for my $FName (@files) { print $LOG myUtils::timestamp() . ": processing $RName: infil +e $FName\n"; my $curtime = time; system($Apps{$RName}{cmdline}, $FName); $curtime = time - $curtime; print $LOG "\t$curtime secs.\n\n"; print "mv $FName $Apps{$RName}{outbound}/$FName"; rename($FName, "$Apps{$RName}{outbound}/$FName") or die $!; } } sleep $DozeTime; } if (-e $DieFile) { print $LOG "Normal halt at " . myUtils::timestamp() . "\n\n"; unlink $DieFile; } else { print $LOG "*UNEXPECTED HALT: " . myUtils::timestamp() . "\n\n"; }

        The basic shell for something that would generate the data file would be like:

        #!/usr/bin/perl -w use strict; use warnings; my $FName = '/Rpts/geezer.csv'; open my $OF, '>', "$FName.work" or die $!; print $OF <<JUNK; A bunch of junk to generate a report. JUNK close $OF; rename "$FName.work", $FName;

        ...roboticus