Re: removing non-duplicates
by davidrw (Prior) on Jul 11, 2005 at 19:36 UTC
|
why didn't unix uniq work? I believe that uniq -u file.txt is what you're looking for. note that the file needs to be sorted first:
sort file.txt | uniq -u > unique_lines.txt
| [reply] [d/l] [select] |
|
|
although sort file | uniq works; why not just use...
sort -u
| [reply] [d/l] |
|
|
beacause that doesn't do what OP wanted. Quotes from the respective man pages:
- sort: -u does output only the first of an equal run (which means all distinct rows)
- uniq: -u does only print unique lines (which means all rows that appear exactly once)
[me@host tmp]$ cat /tmp/t
A
B
A
C
[me@host tmp]$ sort -u /tmp/t
A
B
C
[me@host tmp]$ sort /tmp/t | uniq -u
B
C
| [reply] [d/l] |
|
|
| [reply] |
|
|
heh. score one more for the *nix cmdline utils :)
Just for sake of argument/exercise, even if there was only -c i would still do it on the cmdline (also, these are handy if you want lines that show up N times since -u only helps if N==1):
# using perl:
uniq -c /tmp/d | perl -ne '($n,$s)=split(/\t/,$_,2); print $s if $n ==
+ 1'
# using grep/cut (make sure that's a real tab after the 1 in the grep)
uniq -c /tmp/d | egrep '^ *1 ' | cut -d\t -f2
| [reply] [d/l] [select] |
Re: removing non-duplicates
by Fang (Pilgrim) on Jul 11, 2005 at 19:27 UTC
|
There are many ways to go around this, the most natural one being with the help of a hash. perldoc -q duplicate offers very good starting points.
Update: pushed the submit button too soon once again and forgot to mention two nodes in the Categorized Questions and Answers.
| [reply] |
|
|
is it best to read the file into an array and for each element if it matches then ignore?
or a hash, with key and value equal to parts of each line
| [reply] |
|
|
It really depends on what you want to do in the end. Do you want to create a new file with all the duplicate entries removed? Do you want to keep one instance of each unique entry? Or do you simply need a report about the entries?
From what you told us, I'd say there's no need for reading up the entire file in memory, something like the following should do.
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my $file = "/path/to/your/file";
open(MYFILE, "<", $file) or die "Could not open '$file' for reading: $
+!";
while (<MYFILE>) {
$seen{$_}++;
}
close MYFILE;
# Now every unique entry has a value of 1 in the hash %seen
print "Unique entries:\n";
print "$_\n" for (grep { $seen{$_} == 1 } keys %seen);
| [reply] [d/l] |
|
|
|
|
Re: removing non-duplicates
by jdporter (Paladin) on Jul 11, 2005 at 19:43 UTC
|
my @lines = <>:
my %count;
$count{$_}++ for @lines;
my @unique_lines = grep { $count{$_} == 1 } @lines;
print for @unique_lines; # or whatever you do with the result
This solution does not require the input to be pre-sorted, and it preserves the original order of the lines printed.
| [reply] [d/l] |
Re: removing non-duplicates
by tlm (Prior) on Jul 12, 2005 at 00:43 UTC
|
% perl -ne 'push @a, $_; $h{$_}++; END { print for grep $h{$_}==1, @a
+}' foo.txt
If you don't care to preserve the original order of lines then:
% perl -ne '$h{$_}++; END { print for grep $h{$_}==1, keys %h }' foo.t
+xt
| [reply] [d/l] [select] |
Re: removing non-duplicates
by gopalr (Priest) on Jul 12, 2005 at 05:28 UTC
|
use List::MoreUtils qw(uniq);
use strict;
my @array=<DATA>;
my @array=uniq(@array);
print "\n@array";
__DATA__
at12
at12
at12
at13
bn23
bn23
| [reply] [d/l] |
Re: removing non-duplicates
by anonymized user 468275 (Curate) on Jul 12, 2005 at 09:52 UTC
|
perl -e 'while(<>){ $_{$_} or print $_; $_{$_}=1;}' <file
2) sorted perl -e 'while(<>){ $_{$_}=1;} print sort keys %_;' <file
Note: I deem it to be safe enough to use %_ for quick perl -e usages (without any modules) and not for normal programming where the hash %_ should be replaced by a properly named and declared one.
| [reply] [d/l] [select] |