Re: Find duplicate lines from the file and write it into new file.

If the lines are long Digest::MD5 might be able to solve your problem.

#!/usr/bin/perl
use strict;
use Digest::MD5;

my $file = shift;
my ($input, $check);

open($input, $file) or die "Could not open $file: $!";
open($check, $file) or die "Could not open $file: $!";

my %hash;

while (!eof($input)) {
    my $location = tell($input);
    my $line = readline($input);
    chomp $line;
    my $digest = Digest::MD5::md5($line);
#    $digest = length($line);
    if (defined(my $ll = $hash{$digest})) {
        my $d = 0;
        for my $l (@$ll) {
            seek($check, $l, 0);
            my $checkl = readline($check);
            chomp $checkl;
            if ($checkl eq $line) {
                print "DUP $line\n";
                $d = 1;
                last;
            }
        }
        if ($d == 0) {
            push(@{$hash{$digest}}, $location);
        }
    } else {
        push(@{$hash{$digest}}, $location);
    }
}
[download]

The seek is really over kill in this case, but would be needed if you used a checksum in place of the Digest::MD5 method.

Note: This will only save memory if the average line length is longer than 16 bytes.

UPDATE: Changed code to correctly handle problem pointed out by Corion.

Solution for wojtyk. This will let you have up to 256 passes. It does assume that there is a random distribution of the first byte of the Digest.

#!/usr/bin/perl
use strict;
use Digest::MD5;

my $file = shift;
my ($input, $check);

open($input, $file) or die "Could not open $file: $!";
open($check, $file) or die "Could not open $file: $!";

my %hash;

my $passes = 2;

for (my $pass = 0; $pass < $passes; $pass++) {
    while (!eof($input)) {
        my $location = tell($input);
        my $line = readline($input);
        chomp $line;
        my $digest = Digest::MD5::md5($line);
        my $p = ord($digest);

        if ($p % $passes != $pass) {
            next;
        }

        if (defined(my $ll = $hash{$digest})) {
            my $d = 0;
            for my $l (@$ll) {
                seek($check, $l, 0);
                my $checkl = readline($check);
                chomp $checkl;
                if ($checkl eq $line) {
                    print "DUP $line\n";
                    $d = 1;
                    last;
                }
            }
            if ($d == 0) {
                push(@{$hash{$digest}}, $location);
            }
        } else {
            push(@{$hash{$digest}}, $location);
        }
    }
    seek($input, 0, 0);
}
[download]

-- gam3
A picture is worth a thousand words, but takes 200K.

Comment on Re: Find duplicate lines from the file and write it into new file. Select or Download Code

Replies are listed 'Best First'.
Re^2: Find duplicate lines from the file and write it into new file. by Corion (Patriarch) on Jan 04, 2007 at 15:58 UTC
This will fail for "second" duplicates that map to the same MD5: Assume that two inputs `i1` and `i2` map to the same hash (`m1`), then your code will correctly discard all duplicate instances of `i1` but incorrectly keep all instances of `i2` for the following input: `i1 i2 i2 i2` [download] You will need to keep a list of the offsets for each MD5, at least for every unique line encountered.	[reply] [d/l] [select]
Re^2: Find duplicate lines from the file and write it into new file. by wojtyk (Friar) on Jan 04, 2007 at 19:15 UTC
I think the Digest method is ideal if you have the memory for it. If 16 bytes per line still consumes too much memory, you could take a piecewise approach. Divide available memory by the number of lines in the file (or more accurately, divide by the average line length if you want to waste time on the computation) That should give you the approximate amount of memory you have per line to work with (roughly...you'll probably need to adjust for overhead). If that number is 16 bytes or greater, just use the digest method. If not, do multiple passes doing piecewise duplicate checking, For instance, if it turns out you only have 6 bytes of memory available per line, well then do one run of the data where you treat only the first 6 characters of the line as the line (ie, store the first 6 characters of the line in the hash, and use that to check for dups against the first 6 characters of each line of the rest of the file). Use that to create a new file. That should produce a smaller file of lines that have duplicates amidst the first 6 characters. Since you now have a smaller (but still lossless) file to work with, you can then run another sweep on the new file checking a larger number of characters. Repeat until accurate. It's ugly, and disk expensive, but if you really don't have the memory available, it may be the only way to accomplish the task.	[reply]