removing non-duplicates

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: removing non-duplicates by davidrw (Prior) on Jul 11, 2005 at 19:36 UTC
why didn't unix `uniq` work? I believe that `uniq -u file.txt` is what you're looking for. note that the file needs to be sorted first: `sort file.txt \| uniq -u > unique_lines.txt` [download]	[reply] [d/l] [select]
Re^2: removing non-duplicates by anonymized user 468275 (Curate) on Jul 12, 2005 at 09:30 UTC
although sort file \| uniq works; why not just use... `sort -u` [download] One world, one people	[reply] [d/l]
Re^3: removing non-duplicates by davidrw (Prior) on Jul 12, 2005 at 12:53 UTC
beacause that doesn't do what OP wanted. Quotes from the respective man pages: sort: -u does output only the first of an equal run (which means all distinct rows) uniq: -u does only print unique lines (which means all rows that appear exactly once) `[me@host tmp]$ cat /tmp/t A B A C [me@host tmp]$ sort -u /tmp/t A B C [me@host tmp]$ sort /tmp/t \| uniq -u B C` [download]	[reply] [d/l]
Re^2: removing non-duplicates by Roy Johnson (Monsignor) on Jul 11, 2005 at 21:59 UTC
That gives you one copy of each distinct line. The OP wanted only the lines that appear exactly once. It could be done with uniq -c and grep and cut, but it gets to the point that you just want to do it in Perl. Gar. Should have double-checked that -u option. Good answer. Caution: Contents may have been coded under pressure.	[reply]
Re^3: removing non-duplicates by davidrw (Prior) on Jul 11, 2005 at 22:35 UTC
heh. score one more for the nix cmdline utils :) Just for sake of argument/exercise, even if there was only `-c` i would still do it on the cmdline (also, these are handy if you want lines that show up N times since `-u` only helps if N==1): `# using perl: uniq -c /tmp/d \| perl -ne '($n,$s)=split(/\t/,$_,2); print $s if $n == + 1' # using grep/cut (make sure that's a real tab after the 1 in the grep) uniq -c /tmp/d \| egrep '^ 1 ' \| cut -d\t -f2` [download]	[reply] [d/l] [select]
Re: removing non-duplicates by Fang (Pilgrim) on Jul 11, 2005 at 19:27 UTC
There are many ways to go around this, the most natural one being with the help of a hash. `perldoc -q duplicate` offers very good starting points. Update: pushed the submit button too soon once again and forgot to mention two nodes in the Categorized Questions and Answers. How do I find if an array has duplicate elements, if so discard it? How to find and remove duplicate elements from an array?	[reply]
Re^2: removing non-duplicates by Anonymous Monk on Jul 11, 2005 at 19:33 UTC
is it best to read the file into an array and for each element if it matches then ignore? or a hash, with key and value equal to parts of each line	[reply]
Re^3: removing non-duplicates by Fang (Pilgrim) on Jul 11, 2005 at 19:43 UTC
It really depends on what you want to do in the end. Do you want to create a new file with all the duplicate entries removed? Do you want to keep one instance of each unique entry? Or do you simply need a report about the entries? From what you told us, I'd say there's no need for reading up the entire file in memory, something like the following should do. `#!/usr/bin/perl use strict; use warnings; my %seen; my $file = "/path/to/your/file"; open(MYFILE, "<", $file) or die "Could not open '$file' for reading: $ +!"; while (<MYFILE>) { $seen{$_}++; } close MYFILE; # Now every unique entry has a value of 1 in the hash %seen print "Unique entries:\n"; print "$_\n" for (grep { $seen{$_} == 1 } keys %seen);` [download]	[reply] [d/l]
Re^4: removing non-duplicates by radiantmatrix (Parson) on Jul 11, 2005 at 20:30 UTC
Re^4: removing non-duplicates by Anonymous Monk on Jul 11, 2005 at 19:58 UTC
Re: removing non-duplicates by jdporter (Paladin) on Jul 11, 2005 at 19:43 UTC
`my @lines = <>: my %count; $count{$_}++ for @lines; my @unique_lines = grep { $count{$_} == 1 } @lines; print for @unique_lines; # or whatever you do with the result` [download] This solution does not require the input to be pre-sorted, and it preserves the original order of the lines printed.	[reply] [d/l]
Re: removing non-duplicates by tlm (Prior) on Jul 12, 2005 at 00:43 UTC
Here's a command-line variant of other solutions offered: `% perl -ne 'push @a, $_; $h{$_}++; END { print for grep $h{$_}==1, @a +}' foo.txt` [download] If you don't care to preserve the original order of lines then: `% perl -ne '$h{$_}++; END { print for grep $h{$_}==1, keys %h }' foo.t +xt` [download] the lowliest monk	[reply] [d/l] [select]
Re: removing non-duplicates by gopalr (Priest) on Jul 12, 2005 at 05:28 UTC
TIMTOWDI `use List::MoreUtils qw(uniq); use strict; my @array=<DATA>; my @array=uniq(@array); print "\n@array"; __DATA__ at12 at12 at12 at13 bn23 bn23` [download]	[reply] [d/l]
Re: removing non-duplicates by anonymized user 468275 (Curate) on Jul 12, 2005 at 09:52 UTC
Apart from simply adding the -u option to unix sort, here are two perl options: 1) preserving the existing order: `perl -e 'while(<>){ $_{$_} or print $_; $_{$_}=1;}' <file` [download] 2) sorted `perl -e 'while(<>){ $_{$_}=1;} print sort keys %_;' <file` [download] Note: I deem it to be safe enough to use %_ for quick perl -e usages (without any modules) and not for normal programming where the hash %_ should be replaced by a properly named and declared one. One world, one people	[reply] [d/l] [select]