output unique lines only

sbp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: output unique lines only by swkronenfeld (Hermit) on Dec 06, 2005 at 16:42 UTC
No need for Perl, unless you're doing something more complicated. Type this from your *IX command line. `cut -d" " -f1 FileName \| sort \| uniq`	[reply] [d/l]
Re^2: output unique lines only by Perl Mouse (Chaplain) on Dec 06, 2005 at 16:53 UTC
I'd go for a shell pipe as well, and it would be close to your suggestion. Except that I wouldn't use the final pipe, but use `sort -u` instead. But that's just a minor difference. I won't be handing out 'useless use of uniq' awards. `Perl --((8:>*`	[reply]
Re^2: output unique lines only by merlyn (Sage) on Dec 06, 2005 at 18:13 UTC
Nearly every use of "sort \| uniq" can be replaced with "sort -u". -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^3: output unique lines only by l3v3l (Monk) on Dec 06, 2005 at 19:05 UTC
the only reason to use sort and uniq in combination instead of "sort -u" that I can think of is to skip specific columns when looking for unique intances. example: `... RH_MEa0001bG06_5 710 14 16 Invalid starting position (14) RH_MEa0001bG06_4 710 125 12 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_6 710 125 12 GGGGGACACCTTCTCTCTCT... ...` [download] sending a file containing this output to " \| sort \| uniq -f1" would compare each line and take the first instance that is unique (other than the column you want to skip, column 1 in this case) up to that point and give you : `... RH_MEa0001bG06_4 710 125 12 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_5 710 14 16 Invalid starting position (14) ...` [download]	[reply] [d/l] [select]
Re: output unique lines only by tirwhan (Abbot) on Dec 06, 2005 at 16:33 UTC
You should try to make a little bit of effort to arrive at a solution on your own, at least say "This is what I've tried but it doesn't work and I don't know why". Your task can be solved by reading the file in a loop, using `split` on each line and then putting the first returned element into a hash as a key (for example `$hash{$element}=1`. After you read the whole file you can open another file for writing and do `for my $name(keys %hash) { print $filehandle "$name\n"; }` [download] Try to solve it with that information and do come back and ask if you have problems. Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan	[reply] [d/l] [select]
Re: output unique lines only by davorg (Chancellor) on Dec 06, 2005 at 16:36 UTC
What parts are you having trouble with? Use "open" top open the file Use "< ... >" to read from the file USe "split" to break each line into its parts Use a hash to store the filename Only print filenames if they don't exist in the hash Update: I deliberately didn't give any code as I don't like to help people who show no sign of putting any effort in for themselves. It seems that others don't agree with that policy. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: output unique lines only by chibiryuu (Beadle) on Dec 06, 2005 at 16:36 UTC
`my %seen; while (<>) { s/\t.*//s; $seen{$_}++ or print "$_\n"; }` [download]	[reply] [d/l]
Re: output unique lines only by blazar (Canon) on Dec 06, 2005 at 16:39 UTC
IIUC `$ perl -lne 's/\t.*//; print if !$saw{$_}++' input_file > output_file` [download]	[reply] [d/l]
Re: output unique lines only by EdwardG (Vicar) on Dec 06, 2005 at 16:44 UTC
Here's one approach - Use STDIN and STDOUT for input and output Use a regex to extract the first 'column'. You could also use split, but since you care only about the first column it may be overkill. Use a hash to gather unique filenames Put it all together and you will have something like this: `# uniqfiles.pl use strict; # helps prevent silly mistakes use warnings; # helpful when writing code while (<>) { # Reads from STDIN if (/^(\w+)\t/) { # If the line starts with one or more 'word' char +acters followed by a tab... my $filename = $1; # ...assume we've got a filename captured $uniq_fnames{$filename} = 1; # ...and add it to our hash. } } print $_,"\n" for keys %uniq_fnames; # prints to STDOUT, can be piped + to a file` [download] Then you could use this as follows `perl uniqfiles.pl < my_non_unique_list_of_files > my_unique_list_of_fi +les` [download]	[reply] [d/l] [select]
Re: output unique lines only by cormanaz (Deacon) on Dec 06, 2005 at 19:05 UTC
This is easy to do with a hash. Open the file, read in one line at a time and use the split function to put the first element in each line (i.e. the filename) into a variable like $fn. If your hash is called %uniquefiles you then set the value for $fn to some arbitrary value, like `$uniquefiles{$fn} = 1;` If your loop comes across the same filename again, it will simply set the same value for the same filename, in effect eliminating the dupes. When you're all done %uniquefiles will only contain the unique filenames, which you can print like so: `foreach my $k (keys %uniquefiles) { print OUT "$k\n"; }` [download] If you're just learning Perl, make sure you learn about hashes. They're a very powerful feature. Steve	[reply] [d/l] [select]
Re^2: output unique lines only by sbp (Initiate) on Dec 07, 2005 at 03:21 UTC
Thanks everyone for their tips/suggestions. I've decided to approach this using a hashtable. I came up with the following script but it doesn't seem to be working correctly. `#!/usr/bin/perl -w $filelist = "/home/exp/acctlist.txt"; open(FILEDUPS, $filelist) \|\| die ("Cannot open $filelist"); open($output, '>', '/home/exp/output.txt') \|\| die ("Cannot open file"); while ($line = <FILEDUPS>) { chomp $line; ($filename, undef, undef, undef, undef) = split /\t/, $line; } $uniquefiles{$filename} = 1; foreach $k (keys %uniquefiles) { print $output "$k\n"; }` [download] It currently only outputs one line. For example, if my file contains filename1 filename2 filename1 filename4 Then it outputs the first line only: filename1 Where as it should output: filename1 filename2 filename4 I've spent a long time trying to debug this, but i'm not sure where i'm going wrong. Thanks.	[reply] [d/l]
Re^3: output unique lines only by kulls (Hermit) on Dec 07, 2005 at 04:11 UTC
hi, I guess you should give the `$uniquefiles{$filename} = 1;` inside the while loop. -kulls	[reply] [d/l]