finding duplicate data

harry34 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: finding duplicate data by gjb (Vicar) on Jan 21, 2004 at 10:49 UTC
Sounds very much like homework, so I'll just give you a tip: a hash (e.g. `%data`) would come in handy. You can use the data (`A(01)`, `B(02)`, etc) as keys and the number of times you encounter the data as values. `$data{$line}++;` [download] would do the trick. As a final step, you iterate over the keys (the function `keys` is useful here) in the hash and print those keys that have values larger than 1. Hope this helps, -gjb-	[reply] [d/l] [select]
Re: finding duplicate data by borisz (Canon) on Jan 21, 2004 at 10:54 UTC
use a hash to count the number of each string occurance. `#!/usr/bin/perl while (<DATA>){ chomp; $h{$_}++; } for ( sort grep { $h{$_} != 1 } keys %h ) { print "$_\n"; } __DATA__ A(01) B(02) C(03) A(01) D(04) E(05)` [download] Boris	[reply] [d/l]
Re: Re: finding duplicate data by harry34 (Sexton) on Jan 21, 2004 at 11:23 UTC
That works great !. What is the second part of the code doing ? i.e. how is it working ?	[reply]
Re: Re: Re: finding duplicate data by ysth (Canon) on Jan 21, 2004 at 11:40 UTC
Which part has you puzzled? grep? for () implicitly setting $_? (If the former, you should be able to run "perldoc -f grep" to get a description of what grep does. If for some reason you have a broken perl that doesn't include perldoc, try here.)	[reply]
Re: Re: Re: Re: finding duplicate data by harry34 (Sexton) on Jan 21, 2004 at 11:55 UTC
Re: Re: Re: Re: Re: finding duplicate data by ysth (Canon) on Jan 21, 2004 at 12:08 UTC
Some notes below your chosen depth have not been shown here
Re: finding duplicate data by l3nz (Friar) on Jan 21, 2004 at 12:39 UTC
This is an one-liner. As you can see, you can check the threshold value for items to show up by tuning the constant. Hope this helps. `map { print $_ if ( $h{$_}++ == 1) } <DATA>; __DATA__ A(01) A(01) A(01) B(02) C(03) A(01) D(04) E(05) ...` [download] One-liners are definitely funny.	[reply] [d/l]
Re: finding duplicate data by chimni (Pilgrim) on Jan 21, 2004 at 11:50 UTC
You could also do it in a compact command line manner To find only the duplicate enteries `perl -ne 'print if $h{$_}++' filename` To find unique data `perl -ne 'print unless $h{$_}++' filename` Of course , you could simply do cat filename \| uniq at the shell prompt for the second case. HTH, chimni	[reply] [d/l] [select]
Re^2: finding duplicate data by Roy Johnson (Monsignor) on Jan 21, 2004 at 15:39 UTC
Useless use of cat, and a misunderstanding of uniq (it only looks for duplicate adjacent lines). Instead, use `sort -u filename` [download] to find unique data (though the perl solution would be doing less work, and would not scramble the line order). To list the duplicate entries only once, `perl -ne 'print if $h{$_}++ == 1' filename` [download] The PerlMonk `tr///` Advocate	[reply] [d/l] [select]