Okay, i'll take my armor and my shield, and no axes (today i'm friendly)

  1. Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long)
  2. False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(16*8) posible md5 digests. The probability to have a false positive is 1/2^(16*8) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest.
  3. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)...
  4. about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'(
  5. if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong...
    #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }

Last of all, its nice to receive constructive critics ...

turo

PS: thanks for the abbreviation (sort -u file) i didn't knew it :-)

perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

In reply to Re^3: Deleting duplicate lines from file by turo
in thread Deleting duplicate lines from file by Win

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.