Removing duplicate records in text file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

I wrote a script to remove duplicate records on a text file. However, I don't think my way is the best way to do it since it gets a little slow with large files. This is the code I wrote:

@clean = ();
@check = ();

open(LIST, 'file.txt') or die "Error 1";

while(<LIST>) {

  chomp $_;

  $count = 0;

  foreach $i (@check) {
    chomp $i;
    if($_ eq $i) { $count++; }
  }

  if($count >= 1) { next; }
  else { push(@clean,$_); push(@check,$_); }
}

close(LIST);
[download]

Any ideas on how to make this faster?

Thanks,
Ralph.

Comment on Removing duplicate records in text file Download Code

Replies are listed 'Best First'.
Re: Removing duplicate records in text file by valdez (Monsignor) on Aug 03, 2003 at 18:59 UTC
The easiest solution is an hash; there are many nodes here with solutions to your problem: How can I display a unique array if it contains some repeated elements, How to find and remove duplicate elements from an array?, How Do I Compare Array A to Array B, removing B elements from A. and How do I find if an array has duplicate elements, if so discard it?. Always try first Super Search! Ciao, Valerio	[reply]
Re: Removing duplicate records in text file by pzbagel (Chaplain) on Aug 03, 2003 at 18:56 UTC
Are the records that are duplicate exactly the same? Do you care about the order of the records when you print them out again? You can use a hash to speed things up: `my %uniq=(); open(LIST, 'file.txt') or die "Error 1"; while(<LIST>) { chomp; $uniq{$_}=1; } close(LIST); #the %uniq hash keys now contain all the records with duplicates remov +ed` [download] HTH	[reply] [d/l]
Re: Removing duplicate records in text file by blue_cowdawg (Monsignor) on Aug 03, 2003 at 19:29 UTC
Try this: `. . . Some hand waving here... . . my %een=(); open FIN,"< file.txt" or die $!; my @lines=grep !$een{$_}++,<FIN>; close FIN; . . . whatever else... .` [download] Peter @ Berghold . Net Sieze the cow! Bite the day! Test the code? We don't need to test no stinkin' code! All code posted here is as is where is unless otherwise stated. Brewer of Belgian style Ales	[reply] [d/l]
Re: Removing duplicate records in text file by BUU (Prior) on Aug 03, 2003 at 18:54 UTC
`cat file.txt \| uniq > file2.txt` ok granted it's not perl.. but it is fast.	[reply] [d/l]
Re: Re: Removing duplicate records in text file by pzbagel (Chaplain) on Aug 03, 2003 at 19:03 UTC
You also need to sort the lines in case the duplicate records are not next to each other. `sort file.txt \| uniq > file2.txt` [download] Also, being a Unix-y solution, you may be leaving Windows users out in the cold unless they have cygwin or something similar installed.	[reply] [d/l]
Re: Re: Re: Removing duplicate records in text file by ctilmes (Vicar) on Aug 03, 2003 at 21:52 UTC
Or, if your sort supports it: `sort -u file.txt > file2.txt` [download]	[reply] [d/l]