Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: Deleting duplicate lines from file

by turo (Friar)
on Feb 17, 2006 at 18:35 UTC ( [id://531036]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Deleting duplicate lines from file
in thread Deleting duplicate lines from file

Okay, i'll take my armor and my shield, and no axes (today i'm friendly)

  1. Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long)
  2. False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(16*8) posible md5 digests. The probability to have a false positive is 1/2^(16*8) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest.
  3. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)...
  4. about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'(
  5. if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong...
    #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }

Last of all, its nice to receive constructive critics ...

turo

PS: thanks for the abbreviation (sort -u file) i didn't knew it :-)

perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://531036]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-25 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found