Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

I wrote a script to remove duplicate records on a text file. However, I don't think my way is the best way to do it since it gets a little slow with large files. This is the code I wrote:
@clean = (); @check = (); open(LIST, 'file.txt') or die "Error 1"; while(<LIST>) { chomp $_; $count = 0; foreach $i (@check) { chomp $i; if($_ eq $i) { $count++; } } if($count >= 1) { next; } else { push(@clean,$_); push(@check,$_); } } close(LIST);
Any ideas on how to make this faster?

Thanks,
Ralph.

Replies are listed 'Best First'.
Re: Removing duplicate records in text file
by valdez (Monsignor) on Aug 03, 2003 at 18:59 UTC
Re: Removing duplicate records in text file
by pzbagel (Chaplain) on Aug 03, 2003 at 18:56 UTC

    Are the records that are duplicate exactly the same? Do you care about the order of the records when you print them out again? You can use a hash to speed things up:

    my %uniq=(); open(LIST, 'file.txt') or die "Error 1"; while(<LIST>) { chomp; $uniq{$_}=1; } close(LIST); #the %uniq hash keys now contain all the records with duplicates remov +ed

    HTH

Re: Removing duplicate records in text file
by blue_cowdawg (Monsignor) on Aug 03, 2003 at 19:29 UTC

    Try this:

        . . . Some hand waving here... . . my %een=(); open FIN,"< file.txt" or die $!; my @lines=grep !$een{$_}++,<FIN>; close FIN; . . . whatever else... .

    Peter @ Berghold . Net

    Sieze the cow! Bite the day!

    Test the code? We don't need to test no stinkin' code! All code posted here is as is where is unless otherwise stated.

    Brewer of Belgian style Ales

Re: Removing duplicate records in text file
by BUU (Prior) on Aug 03, 2003 at 18:54 UTC
     cat file.txt | uniq > file2.txt ok granted it's not perl.. but it is fast.

      You also need to sort the lines in case the duplicate records are not next to each other.

      sort file.txt | uniq > file2.txt

      Also, being a Unix-y solution, you may be leaving Windows users out in the cold unless they have cygwin or something similar installed.

        Or, if your sort supports it:
        sort -u file.txt > file2.txt