Removing duplicate filenames in different directories from an array

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have an array of filenames that looks something like this:

/data/node12/file-29-2.txt
/data/node12/file-34-2.txt
/data/node12/file-50-2.txt
/data/node30/file-34-2.txt
/data/node30/file-60-2.txt
/data/node30/file-62-2.txt
/data/node34/file-29-2.txt
[download]

etc. I want to remove duplicates from this array in the sense that files with the same -##- are identical, even if they are in different directories. So in the example above, I would want to eliminate /data/node30/file-34-2.txt and /data/node34/file-29-2.txt . I can think of ways to do this, but they are probably inefficient. Since the actual array contains ~10^6 filenames, it needs to be efficient. I believe there is an easy way to do this with hashes, but I can't remember it. Thanks!

Comment on Removing duplicate filenames in different directories from an array Download Code

Replies are listed 'Best First'.
Re: Removing duplicate filenames in different directories from an array by wind (Priest) on Jan 30, 2011 at 20:29 UTC
The code can probably be condensed, but this will get you what you want. `use strict; use File::Basename qw(basename); my @files = qw( /data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt ); my %seen = (); foreach my $file (@files) { my $bn = basename($file); if ($bn =~ /(\d+)/) { my $id = $1; if ($seen{$id}++) { print "$file needs to be deleted\n"; } } else { warn "Unexpected file found: $file\n"; } }` [download] - Miller	[reply] [d/l]
Re^2: Removing duplicate filenames in different directories from an array by AnomalousMonk (Archbishop) on Jan 30, 2011 at 23:14 UTC
... condensed ... `>perl -wMstrict -le "my @filenames = qw( /data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt ); ;; my %seen; my @unique = grep { m{ (-\d\d-) }xms; !$seen{$1}++ } @filenames; print qq{'$_'} for @unique; " '/data/node12/file-29-2.txt' '/data/node12/file-34-2.txt' '/data/node12/file-50-2.txt' '/data/node30/file-60-2.txt' '/data/node30/file-62-2.txt'` [download] If the file names are in a file with one name per line, this could even be a one-liner.	[reply] [d/l]
Re: Removing duplicate filenames in different directories from an array by chrestomanci (Priest) on Jan 30, 2011 at 21:15 UTC
You could put the file-names to keep in an array. Perl will waste a bit of space on empty slots, but so long as the largest number is not huge it should be efficient. `my @filenames foreach my $file (@files) { if( $file =~ m:/file-(\d+)-2.txt$: ) { $filenames[$1] = $file; } else { warn "Unexpected file found: $file\n"; } } foreach my $file (@files) { print $file if defined $file; }` [download] This method will silently discard duplicates, which may or may not be a problem for what you are trying to do, but it will be fast.	[reply] [d/l]