uvnew has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have a dataset of 100,000 lines of this sort (with emphasis on the first and last lines in this example): ENSP00000372533 ENSP00000372214 ENSP00000372533 ENSP00000362744 ENSP00000372525 ENSP00000368486 ENSP00000372521 ENSP00000355119 ENSP00000372521 ENSP00000362981 ENSP00000372214 ENSP00000372533 Every line such as: ENSP1 ENSP4 will have later somewhere in the set the same line equivalent but in opposite order: ENSP4 ENSP1 (which I consider as a redundancy). want to make this set a non-redundant one. Could you please suggest a way of "cleaning" that dataset- getting rid of lines that already exist, but in opposite order? I couldn't think how to do it in a way which will not be immensly time consuming and clumsy. Thanks a lot for any idea!

Replies are listed 'Best First'.
Re: Creating a non-redundant set
by blokhead (Monsignor) on Jul 18, 2007 at 15:12 UTC
    If it's true that *every* line also has its partner occurring elsewhere in the file, then all you have to do is scan through the file and throw out lines where you have "ENSPxxx ENSPyyy" where xxx < yyy. This will leave only the lines where xxx > yyy, which is only one line for each pair of {xxx,yyy} that occurs in the file.

    If some lines' partners do not appear in the file, then I doubt there's a simple way other than naively going through and keeping track of which lines you've seen. The above solution also may not be appropriate if you have some requirements about the order in which the lines appear (e.g, only the first occurrence should be preserved, not necessarily the one with xxx > yyy).

    blokhead

Re: Creating a non-redundant set
by citromatik (Curate) on Jul 18, 2007 at 15:25 UTC

    Maybe you can store all the pairs sorted in a hash:

    use strict; use warnings; use Data::Dumper; my %data = (); while (<DATA>){ chomp; $data{join (' ',sort {$a cmp $b} split / /,$_)} = 1; } print Dumper \%data; __DATA__ ENSP000010 ENSP000011 ENSP000020 ENSP000050 ENSP000011 ENSP000010 ENSP000050 ENSP000020

    Outputs:

    VAR1 = { 'ENSP000020 ENSP000050' => 1, 'ENSP000010 ENSP000011' => 1 };

    citromatik