perllearner007 has asked for the wisdom of the Perl Monks concerning the following question:

How can I remove duplicate entries of genes from an enormous gene list using perl? My list contains the gene names and is a txt file.

Replies are listed 'Best First'.
Re: Duplicate entries?
by toolic (Bishop) on Jan 11, 2012 at 19:32 UTC
      this subroutine will sort the list of genes if you have put them into an array where each gene (or line containing the gene) is an element of the array and where each duplicated entry is of the same format.
      sub findDupsInArray { my @array =(@_); @array = sort{$a cmp $b} @array; my $previtem; foreach my $item (@array){ if($item ne $previtem ) { push (@dups,$previtem); }#if $previtem = $item; }#foreach return @dups; }
      Alternatively you can make the whole list a hash where gene name is the key and gene description is the value. then export the hash since the hash will not allow duplicated keys.
Re: Duplicate entries?
by Marshall (Canon) on Jan 11, 2012 at 19:51 UTC
    Another way to remove duplicates is to just use the command line sort. Command line sort is not limited to having the entire file memory resident and can sort a HUGE file. Then cycle through that sorted file and don't output lines if the current line matched the immediately preceding line.

      If on *nix you can pipe the sort output into uniq (http://en.wikipedia.org/wiki/Uniq) to get rid of adjacent duplicates.

      knoppix@Microknoppix:~$ cat rubbish cat fish dog apple cat bird knoppix@Microknoppix:~$ sort rubbish | uniq apple bird cat dog fish knoppix@Microknoppix:~$

      I hope this is of interest.

      Cheers,

      JohnGG