Re^2: Deleting duplicate lines from file

While I often use (md5) sums, I think that this is an overkill for checking duplicate lines; and as usual exposes to the risk of false positives while for reasonably sized lines, which are to be expected in this case, it is quite reasonable to assume that the md5sum will have a size comparable to that of the string itself, or -depending on the actual data- even larger.

Also, the code seems just a little bit too verbose for may tastes. Without that verbosity adding to readability, that is. However they're just tastes, so I won't insist too much on this point.

Last, if one needs to print non-duplicate lines, it's pointlessly resources-consuming to gather them into an array to print them all together. Granted, this may be an illustration for a more general situation in which one may actually need to store all of them in one place. But the OP is clearly a newbie and I fear that doing so in this case would risk being cargo culted into the bad habit of unnecessarily assigning to unnecessary variables all the time.

Oh, and the very last thing about your suggestion:

cat file | sort | uniq
[download]

The following is just equivalent:

sort -u file
[download]

Comment on Re^2: Deleting duplicate lines from file Select or Download Code

Replies are listed 'Best First'.
Re^3: Deleting duplicate lines from file by turo (Friar) on Feb 17, 2006 at 18:35 UTC
Okay, i'll take my armor and my shield, and no axes (today i'm friendly) Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long) False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(168) posible md5 digests. The probability to have a false positive is 1/2^(168) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)... about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'( if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong... `#!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }` [download] Last of all, its nice to receive constructive critics ... turo PS: thanks for the abbreviation (`sort -u file`) i didn't knew it :-) `perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'`	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Deleting duplicate lines from file
by turo (Friar) on Feb 17, 2006 at 18:35 UTC

Okay, i'll take my armor and my shield, and no axes (today i'm friendly)

Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long)
False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(16*8) posible md5 digests. The probability to have a false positive is 1/2^(16*8) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest.
"is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)...
about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'(
if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong...
```
#!/usr/bin/perl 
use strict;
use Digest::MD5 qw(md5);

my %line;
while (<>) {
        my $digest = md5($_);
        unless ( exists $line{$digest} and $line{$digest} == length ) 
+{
                $line{$digest} = length;
                print;
        }
}
[download]
```

Last of all, its nice to receive constructive critics ...

turo

PS: thanks for the abbreviation (sort -u file) i didn't knew it :-)

perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

[reply]
[d/l]
[select]