comment on

I don't know why this is taking so long.

Even on Windows XP, a directory with 10K or 20K files is no big deal. With the Windows NTFS file system there are very good reasons not to put a lot of files in the "root C:\" directory, but you aren't doing that. A sub-directory can have 50K files with no problem.

I would read the "target directory" and then code what I call an "execution plan". Moving files is a "destructive operation" because it modifies the input data. Copying files is not destructive, but takes longer.

Anyway, I would code the basic algorithm and leave the actual file moving or copying to a final step. I often code a constant like use constant ENABLE_MOVE => 0; I run the code to make sure that it is going to do what I want before I turn that variable "on".

Your code should just take some seconds to decide what to do. Take the actual move or copy out of the equation until you have an efficient algorithm. Below I just print an intention of what would happen. Get that working efficiently then "turn on" the actual file operation(s).

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dump qw(pp);
$|=1;  #turn off buffering to stdout for debugging

my %HoH;  #{extension}{name}

while (my $full_name = <DATA>)
{
    next if $full_name =~ /^\./; # skip names beginning with dot
    my ($name, $ext) = $full_name =~ /([\w.]+)\.(\w+)$/;
    next unless defined $ext;    # skip bare names wihout .extension
    
    $HoH{$ext}{$name}=1;
}

pp \%HoH;

foreach my $pdf_file (keys %{$HoH{pdf}})
{
    if (exists $HoH{epub}{$pdf_file})
    {
       print "do something with $pdf_file.pdf and $pdf_file.epub\n";
    }
}

=prints
{
  doc  => { baz => 1 },
  epub => { bar => 1, baz => 1, boo => 1 },
  pdf  => { baz => 1 },
  txt  => { "baz" => 1, "boo" => 1, "some.long.name" => 1 },
}
do something with baz.pdf and baz.epub
=cut

__DATA__
.
..
some.long.name.txt
baz.txt
baz.epub
baz.doc
baz.pdf
bar.epub
boo.epub
boo.txt
barefile
[download]

In reply to Re: Duplicates in Directories by Marshall
in thread Duplicates in Directories by kel

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.