Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have some data that i want to sort, it is in a table format where the first column is an id, and the other columns contain information. Unfortunately, the id's are not always unique (there may be more than 1 row per id). I want to simply get rid of all redundant rows and keep only the first row for each id (the row nearest the top).

This what the data looks like, the id's in this case are 1374, 1374, and 1450:

1374:1-202 gb|AE000516.2| Mycobacterium tuberculosis CDC1551, com +plete genome 34.5 69 3.6 202 4403837 14/48 + 29% 25/48 52% 38 181 1895058 1895201 1374:1-202 gb|AE000516.2| Mycobacterium tuberculosis CDC1551, com +plete genome 34.1 68 5.0 2 1450:1-202 emb|BX248345.1| Mycobacterium bovis subsp. bovis AF212 +2/97 complete genome; segment 12/14 70.3 147 6e-11 202 30 +8050 28/59 47% 43/59 72% 17 193 1686 +81 168505
In this case i would want to get rid of the second row and keep only the 1st and 3rd rows.

I thought that this code would work but instead it is printing all rows, not just one row per id. Can anyone please help out??

Here is my code:

for (my $i=0; $i<@parsed_file; $i++) { my @record = $parsed_file[$i]; my $record = join ('', @record); @record = split (/\t/, $record); $num = $freq{$record[0]}{"freq"}++; $freq{$array[0]}{"value"}[$num] = $_; + my @id; push (@id, $record[0]); } + # i sort based on id to extract unique id's my @sorted_array = sort {$freq{$b}{"freq"} <=> $freq{$a}{"freq"}} keys + %freq; ##print "$sorted_array[0]\n"; + for (my $i=0; $i<@parsed_file; $i++) { my @hit = $parsed_file[$i]; my $hit = join ('', @record); @hit = split (/\t/, $record); my $c=0; my $id2 = $hit[0]; + foreach my $id (@sorted_array) { if ($id == $id2) { ++$c; } # try to match unique id's to the file and print the first instance fo +und, but it prints everything if ($c == 1) { print "$parsed_file[$i]\n"; } } }

Replies are listed 'Best First'.
Re: sorting arrays
by gam3 (Curate) on Apr 12, 2005 at 17:41 UTC
    %unique; @unique = grep({!$unique{$_->[0]}++} @parsed_file)
    -- gam3
    A picture is worth a thousand words, but takes 200K.
Re: sorting arrays
by sasikumar (Monk) on Apr 12, 2005 at 17:41 UTC
    Hi

    Use a hashtable with the column 1 as your key value. before adding into the hashtable Chk for the the key if it exist do not overwrite it just ignore.

    Thanks
    SasiKumar
Re: sorting arrays
by gaal (Parson) on Apr 12, 2005 at 17:45 UTC
    If this data fits easily in memory, go over each row and extract the id from it, then insert ($id => $original_line) into a hash, if this id is new. Then print the hash, sorted by keys.

    my %seen; for my $row (@data) { my ($id) =~ /^(\d+):/ or die "bad line: [$row]"; $seen{$id} ||= $row; } print $seen{$_}, "\n" for sort {$a<=>$b} keys %seen;
Re: sorting arrays
by tlm (Prior) on Apr 12, 2005 at 18:47 UTC

    On Unix a simple alternative to perl is to use the system's sort command:

    % sort -uk 1,1 datafile > sorted_datafile
    That will sort strictly (i.e. no duplicates) on the first field; the first instance encountered is the one kept. See man sort (or info sort if your system uses GNU's sort).

    the lowliest monk

      Hash slices would be ok too.
      my %hash; @hash(@array) = (@array); @sorted = keys %hash;