comment on

If the lines are long Digest::MD5 might be able to solve your problem.

#!/usr/bin/perl
use strict;
use Digest::MD5;

my $file = shift;
my ($input, $check);

open($input, $file) or die "Could not open $file: $!";
open($check, $file) or die "Could not open $file: $!";

my %hash;

while (!eof($input)) {
    my $location = tell($input);
    my $line = readline($input);
    chomp $line;
    my $digest = Digest::MD5::md5($line);
#    $digest = length($line);
    if (defined(my $ll = $hash{$digest})) {
        my $d = 0;
        for my $l (@$ll) {
            seek($check, $l, 0);
            my $checkl = readline($check);
            chomp $checkl;
            if ($checkl eq $line) {
                print "DUP $line\n";
                $d = 1;
                last;
            }
        }
        if ($d == 0) {
            push(@{$hash{$digest}}, $location);
        }
    } else {
        push(@{$hash{$digest}}, $location);
    }
}
[download]

The seek is really over kill in this case, but would be needed if you used a checksum in place of the Digest::MD5 method.

Note: This will only save memory if the average line length is longer than 16 bytes.

UPDATE: Changed code to correctly handle problem pointed out by Corion.

Solution for wojtyk. This will let you have up to 256 passes. It does assume that there is a random distribution of the first byte of the Digest.

#!/usr/bin/perl
use strict;
use Digest::MD5;

my $file = shift;
my ($input, $check);

open($input, $file) or die "Could not open $file: $!";
open($check, $file) or die "Could not open $file: $!";

my %hash;

my $passes = 2;

for (my $pass = 0; $pass < $passes; $pass++) {
    while (!eof($input)) {
        my $location = tell($input);
        my $line = readline($input);
        chomp $line;
        my $digest = Digest::MD5::md5($line);
        my $p = ord($digest);

        if ($p % $passes != $pass) {
            next;
        }

        if (defined(my $ll = $hash{$digest})) {
            my $d = 0;
            for my $l (@$ll) {
                seek($check, $l, 0);
                my $checkl = readline($check);
                chomp $checkl;
                if ($checkl eq $line) {
                    print "DUP $line\n";
                    $d = 1;
                    last;
                }
            }
            if ($d == 0) {
                push(@{$hash{$digest}}, $location);
            }
        } else {
            push(@{$hash{$digest}}, $location);
        }
    }
    seek($input, 0, 0);
}
[download]

-- gam3
A picture is worth a thousand words, but takes 200K.

In reply to Re: Find duplicate lines from the file and write it into new file. by gam3
in thread Find duplicate lines from the file and write it into new file. by anna_here

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.