Re: Duplicates in Directories

I don't know why this is taking so long.

Even on Windows XP, a directory with 10K or 20K files is no big deal. With the Windows NTFS file system there are very good reasons not to put a lot of files in the "root C:\" directory, but you aren't doing that. A sub-directory can have 50K files with no problem.

I would read the "target directory" and then code what I call an "execution plan". Moving files is a "destructive operation" because it modifies the input data. Copying files is not destructive, but takes longer.

Anyway, I would code the basic algorithm and leave the actual file moving or copying to a final step. I often code a constant like use constant ENABLE_MOVE => 0; I run the code to make sure that it is going to do what I want before I turn that variable "on".

Your code should just take some seconds to decide what to do. Take the actual move or copy out of the equation until you have an efficient algorithm. Below I just print an intention of what would happen. Get that working efficiently then "turn on" the actual file operation(s).

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dump qw(pp);
$|=1;  #turn off buffering to stdout for debugging

my %HoH;  #{extension}{name}

while (my $full_name = <DATA>)
{
    next if $full_name =~ /^\./; # skip names beginning with dot
    my ($name, $ext) = $full_name =~ /([\w.]+)\.(\w+)$/;
    next unless defined $ext;    # skip bare names wihout .extension
    
    $HoH{$ext}{$name}=1;
}

pp \%HoH;

foreach my $pdf_file (keys %{$HoH{pdf}})
{
    if (exists $HoH{epub}{$pdf_file})
    {
       print "do something with $pdf_file.pdf and $pdf_file.epub\n";
    }
}

=prints
{
  doc  => { baz => 1 },
  epub => { bar => 1, baz => 1, boo => 1 },
  pdf  => { baz => 1 },
  txt  => { "baz" => 1, "boo" => 1, "some.long.name" => 1 },
}
do something with baz.pdf and baz.epub
=cut

__DATA__
.
..
some.long.name.txt
baz.txt
baz.epub
baz.doc
baz.pdf
bar.epub
boo.epub
boo.txt
barefile
[download]

Comment on Re: Duplicates in Directories Select or Download Code