Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have an array of filenames that looks something like this:
/data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt
etc. I want to remove duplicates from this array in the sense that files with the same -##- are identical, even if they are in different directories. So in the example above, I would want to eliminate /data/node30/file-34-2.txt and /data/node34/file-29-2.txt . I can think of ways to do this, but they are probably inefficient. Since the actual array contains ~10^6 filenames, it needs to be efficient. I believe there is an easy way to do this with hashes, but I can't remember it. Thanks!
  • Comment on Removing duplicate filenames in different directories from an array
  • Download Code

Replies are listed 'Best First'.
Re: Removing duplicate filenames in different directories from an array
by wind (Priest) on Jan 30, 2011 at 20:29 UTC
    The code can probably be condensed, but this will get you what you want.
    use strict; use File::Basename qw(basename); my @files = qw( /data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt ); my %seen = (); foreach my $file (@files) { my $bn = basename($file); if ($bn =~ /(\d+)/) { my $id = $1; if ($seen{$id}++) { print "$file needs to be deleted\n"; } } else { warn "Unexpected file found: $file\n"; } }
    - Miller
      ... condensed ...
      >perl -wMstrict -le "my @filenames = qw( /data/node12/file-29-2.txt /data/node12/file-34-2.txt /data/node12/file-50-2.txt /data/node30/file-34-2.txt /data/node30/file-60-2.txt /data/node30/file-62-2.txt /data/node34/file-29-2.txt ); ;; my %seen; my @unique = grep { m{ (-\d\d-) }xms; !$seen{$1}++ } @filenames; print qq{'$_'} for @unique; " '/data/node12/file-29-2.txt' '/data/node12/file-34-2.txt' '/data/node12/file-50-2.txt' '/data/node30/file-60-2.txt' '/data/node30/file-62-2.txt'

      If the file names are in a file with one name per line, this could even be a one-liner.

Re: Removing duplicate filenames in different directories from an array
by chrestomanci (Priest) on Jan 30, 2011 at 21:15 UTC

    You could put the file-names to keep in an array. Perl will waste a bit of space on empty slots, but so long as the largest number is not huge it should be efficient.

    my @filenames foreach my $file (@files) { if( $file =~ m:/file-(\d+)-2.txt$: ) { $filenames[$1] = $file; } else { warn "Unexpected file found: $file\n"; } } foreach my $file (@files) { print $file if defined $file; }

    This method will silently discard duplicates, which may or may not be a problem for what you are trying to do, but it will be fast.