Perl's built-in rename may be faster than File::Copy's move, but mostly due to the fact that move contains logic to decide whether it can do a rename, or whether it must fall back to a copy and unlink (for example, spanning file systems).
I was curious how quickly I could burn through 5,000 files while reading them line by line, finishing the read when I find the CHNL_ID line, and then renaming them within the same filesystem to a subdir based on the ID found. So I created a script that does just that:
It was interesting to me that after creating the files (which took some time), I was able to process 5,000 of them in under six seconds. My SSD is pretty fast, so your mileage will certainly vary. But I'm not seeing performance being a big problem, particularly where this only runs on a nightly basis. Here's the code:
#!/usr/bin/env perl use strict; use warnings; use feature qw/say/; use File::Temp; use File::Path qw(make_path); use File::Spec::Functions qw(catdir catfile); use Time::HiRes qw(tv_interval gettimeofday); use Fcntl qw(:flock); my $FILE_COUNT = 5_000; # SETUP - Create 500k files that contain approximately 350k data with +the # CHNL_ID line randomly distributed in each file. say "Generating $FILE_COUNT temporary files."; my @base_content = grep {!m/^\QA|CHNL_ID|\E\d+\n/} <DATA>; @base_content = (@base_content) x 1024; my $td = File::Temp->newdir( TEMPLATE => 'pm_tempXXXXX', TMPDIR => 1, CLEANUP => 1, ); for my $n (0 .. 31) { make_path(catdir($td->dirname, sprintf("%02d", $n))); } for (1 .. $FILE_COUNT) { my $rand_ix = int(rand(scalar(@base_content))); my $chnl_id = sprintf "%02d", int(rand(32)); my @output; for my $line_ix (0 .. $#base_content) { push @output, "A|CHNL_ID|$chnl_id\n" if $line_ix == $rand_ix; push @output, $base_content[$line_ix]; } my $tf = File::Temp->new( TEMPLATE => 'pm_XXXXXXXXXXXX', SUFFIX => '.txt', DIR => $td->dirname, UNLINK => 0, ); print $tf @output; $tf->flush; close $tf; } # Sample file processor: say "Processing of $FILE_COUNT files."; my $t0 = [gettimeofday]; opendir my $dh, $td->dirname or die "Cannot open temporary directory (", $td->dirname, "): $!\n +"; FILE: while (defined(my $dirent = readdir($dh))) { next if $dirent =~ m/^\.\.?$/; next unless $dirent =~ m/\.txt$/; my $path = catfile($td->dirname, $dirent); next unless -f $path; open my $fh, '<', $path or die "Cannot open $path for read: $!"; flock $fh, LOCK_EX or die "Error obtaining a lock on $path: $!"; while (defined(my $line = <$fh>)) { if ($line =~ m/^\QA|CHNL_ID|\E(\d+)$/m) { my $target_dir = catdir($td->dirname, $1); make_path($target_dir) unless -d $target_dir; my $dest = catfile($target_dir, $dirent); rename $path, $dest or die "Could not rename $path into $d +est: $!"; close $fh; next FILE; } } warn "Did not find CHNL_ID in $path. Skipping.\n"; close $fh; } my $elapsed = tv_interval($t0); say "Completed processing $FILE_COUNT files in $elapsed seconds."; __DATA__ A|RCPNT_ID|92299999 A|RCPNT_TYP_CD|QL A|ALERT_ID|264 A|FROM_ADDR_TX|14084007183 A|RQST_ID|PT201803989898 A|CRTEN_DT|02072018 A|CHNL_ID|17 A|RCPNT_FRST_NM|TESTSMSMIGRATION A|SBJ_TX|Subject value from CDC A|CLT_ID|14043 A|ALRT_NM|Order Shipped A|CNTCT_ADDR|16166354429 A|RCPNT_LAST_NM|MEMBER A|ORDR_NB|2650249999 A|LOB_CD|PBM D|QL_922917566|20180313123311|1|TESTSMSMIGRATION MEMBER||
The output:
Generating 5000 temporary files. Processing of 5000 files. Completed processing 5000 files in 5.581455 seconds.
If you are using a slow spindle drive and a solution similar to this one actually does require way too much time, then you may want to run once per hour instead of nightly. That will require a little more effort to assure that only one process runs at a time, and to assure that you're only dealing with files that the program that creates them is done with, but all of those concerns can be solved with a little thought and code.
If you are dealing with 500,000 files instead of the 5,000 I sampled here, then I would expect that with an equivalent system you should be able to process those 500,000 in about 558 seconds, or 9 minutes, 20 seconds. You mentioned you are processing 80k files per hour, but on my system this script processes up to about 3,000,000 per hour, so about 37x more per hour than you have been experiencing. It's possible some of the improvement comes from not reading each file in its entirety, but given how I'm distributing the trigger line randomly throughout the file, that shouldn't account for more than a halving, on average, of the total run time. Possibly your move was doing a full copy, which would account for a lot more of the time.
I'll suggest that if a method such as this one isn't fast enough, and running it more frequently isn't possible, you're going to have to do some profiling to determine where all the time is being spent.
Dave
In reply to Re^3: Perl Program to efficiently process 500000 small files in a Directory (AIX)
by davido
in thread Perl Program to efficiently process 500000 small files in a Directory (AIX)
by DenairPete
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |