xtracting unique lines

anasuya has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have a file which looks like this. It has two fields which are separated by a '+' sign.

d_145_1_2- + c_3_1_8-e_74_1_1-
a_100_1_6-c_2_1_6- + b_50_1_2-
c_69_1_17- + b_61_6_1-
c_2_1_2- + a_123_1_1-
d_83_1_1- + c_2_1_5-d_162_1_1-
c_2_1_2- + a_123_1_1-
a_123_1_1- + c_2_1_2-
[download]

What I need to do is to extract out lines which are unique in this file. For example here,from the snippet of the file above, the following lines are unique:

d_145_1_2- + c_3_1_8-e_74_1_1-
a_100_1_6-c_2_1_6- + b_50_1_2-
c_69_1_17- + b_61_6_1-
c_2_1_2- + a_123_1_1-
d_83_1_1- + c_2_1_5-d_162_1_1-
[download]

One shall notice that the fields a_123_1_1- and c_2_1_2- occur as a pair more than once, however in such a way that their relative order is reversed. Is there anyway I can extract out unique lines, keeping only one occurrence of such pairs i.e. a_123_1_1- and c_2_1_2-? I have as of now tried awk. There, I was unable to retrieve unique lines using the uniq function as that doesn't take care of the same combinations of fields repeating in reverse orders. Also I tried merging the two fields together and then carrying out awk operations but to no avail. Is there any way such that perl makes the job easier?

Comment on xtracting unique lines Select or Download Code

Replies are listed 'Best First'.
Re: xtracting unique lines by Happy-the-monk (Canon) on Mar 27, 2012 at 18:22 UTC
I'd split the pairs by the " `+` " string, `sort {$a cmp $b}` the pair into a temporary array. Use as hash key a string joined by the " `+` " string made of that array... and the original string as the hash value. When done print out all the values. To shorten it, make the split, sort and join in one go and you get rid of the temporary array. Cheers, Sören	[reply]
Re: xtracting unique lines by nemesdani (Friar) on Mar 27, 2012 at 18:19 UTC
Read the file. Make a hash. Split the line. Check each part, if it exists in the hash. If not, fill the fields into the hash. Have fun while doing it!	[reply]
Re: xtracting unique lines by Cristoforo (Curate) on Mar 28, 2012 at 02:03 UTC
Using grep, you can filter out duplicate fields by testing to see if they have been seen yet. `#!/usr/bin/perl use strict; use warnings; my %seen; { local $\ = "\n"; # call to print() ends in newline while (<DATA>) { chomp; print unless grep $seen{$_}++, split /\s+\+\s+/; } }` [download] Chris Update: Misread the question, missed that they can occur reversed. This should produce the results. `#!/usr/bin/perl use strict; use warnings; my %seen; { local $\ = "\n"; # call to print() ends in newline while (<DATA>) { chomp; my $sorted = join "", sort split /\s\+\s/; print unless $seen{$sorted}++; } } __DATA__ d_145_1_2- + c_3_1_8-e_74_1_1- a_100_1_6-c_2_1_6- + b_50_1_2- c_69_1_17- + b_61_6_1- c_2_1_2- + a_123_1_1- d_83_1_1- + c_2_1_5-d_162_1_1- c_2_1_2- + a_123_1_1- a_123_1_1- + c_2_1_2-` [download]	[reply] [d/l] [select]
Re^2: xtracting unique lines by anasuya (Novice) on Mar 28, 2012 at 11:07 UTC
Hi. I tried out what you sed above. It worked. thanks.. Now what i need to do further is count the occurrences of each of these lines. As you can see in <DATA>, the string "c_2_1_2- + a_123_1_1-" has occurred 2 times and the reverse of it "a_123_1_1- + c_2_1_2-" has occurred once. Now i need to get a cumulative count for this pair (irrespective of the order in which it occurs i.e. as "a_123_1_1- + c_2_1_2-" or as "c_2_1_2- + a_123_1_1-", so that the total count of this entry is =3 as in <DATA>) The actual file which i am working on is similar but is larger in size, and has around 8000 lines. What is the solution to this problem? awk hasn't helped me so far.	[reply]
Re: xtracting unique lines by johngg (Canon) on Mar 28, 2012 at 07:29 UTC
This is similar to Cristoforo's solution but using a sort of Schwartzian Transform to sort the keys for the `%seen` hash. knoppix@Microknoppix:~$ perl -E ' > open my $inFH, q{<}, \ <<EOD or die qq{open: <<HEREDOC: $!\n}; > d_145_1_2- + c_3_1_8-e_74_1_1- > a_100_1_6-c_2_1_6- + b_50_1_2- > c_69_1_17- + b_61_6_1- > c_2_1_2- + a_123_1_1- > d_83_1_1- + c_2_1_5-d_162_1_1- > c_2_1_2- + a_123_1_1- > a_123_1_1- + c_2_1_2- > EOD > > my %seen; > print > map { qq{$_->[ 0 ]\n} } > grep { ! $seen{ $_->[ 1 ] } ++ } > map { chomp; [ $_, join( q{:}, sort split m{ \+ }, $_ ) ] } > <$inFH>;' d_145_1_2- + c_3_1_8-e_74_1_1- a_100_1_6-c_2_1_6- + b_50_1_2- c_69_1_17- + b_61_6_1- c_2_1_2- + a_123_1_1- d_83_1_1- + c_2_1_5-d_162_1_1- knoppix@Microknoppix:~$ [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]