Finding duplicates

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding duplicates by halley (Prior) on Aug 12, 2003 at 18:18 UTC
What have you tried already? What data structures in Perl do you think may be useful here? (Hint: when you hear 'duplicates' or 'unique', you should think hash.) I would say more, but this sounds like homework. Show us you've done some thinking first? How to ask questions the smart way. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: Re: Finding duplicates by Anonymous Monk on Aug 12, 2003 at 18:28 UTC
Here is what I attempted and it is not working: `$db = "textfile.txt"; open(DATA, "$db") or die "cant open: $!\n"; @dat = (<DATA>); close(DATA); open(DATA, "$db") \|\| die "cant open: $!\n"; foreach $line (@dat) { if($line =~ /87/g) #I tried this just to see if I could fetch any + data in my text file { print "test\n"; } } close(DATA);` [download]	[reply] [d/l]
Re: Re: Re: Finding duplicates by halley (Prior) on Aug 12, 2003 at 18:50 UTC
Okay, you have combined two separate methods of reading the lines in the file. Pick one. They are functionally identical, but I recommend the latter because it doesn't require the WHOLE file to be in memory at any given time. `... $db = "textfile.txt"; open(DATA, $db) or die "cant open: $!\n"; @dat = <DATA>; close(DATA); foreach $line (@dat) { ... }` [download] `... $db = "textfile.txt"; open(DATA, $db) \|\| die "cant open: $!\n"; foreach $line (<DATA>) { ... } close(DATA);` [download] The instances of `...` mark the areas where you're hoping for some help. You only care about fields 2 and 5 of each line. You either want to print any line that has already been seen, or you want to print any line that has not already been seen. Break down the problem further. You need to keep track of what's been seen in some kind of data structure. (I hinted a hash.) You need to test each line in the file against the data structure to see if it's been seen before, or not. You need to decide whether to print the line or not. You need to add the crucial fields to the data structure so your future iterations have something to check. Again, I'm treating this like it's homework, and drawing you through the thinking process, rather than just handing you a solution. If you just want to be given code, I'm sure some other folks are happy to grant your wish. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Finding duplicates by flounder99 (Friar) on Aug 12, 2003 at 19:56 UTC
Maybe this code will start you on the right direction `use strict; my $newdata = '124\|20030812\|uiy\|kjh\|87'; my $db = "textfile.txt"; open (DATAFILE, "+<$db") or die $!; my $key = join("\|", (split /\\|/, $newdata)[1,4]); while (<DATAFILE>) { chomp; if (join("\|", (split /\\|/)[1,4]) eq $key) { close DATAFILE; exit; } } print DATAFILE $newdata,"\n"; close DATAFILE;` [download] `$newdata` will be added to the text file if it is not already there. I would probably put this in a `sub` and do a `return` instead of an `exit`. This consumes no extra memory no matter how big `textfile.txt` gets. -- flounder	[reply] [d/l] [select]
Re: Re: Finding duplicates by Anonymous Monk on Aug 13, 2003 at 12:32 UTC
thanks	[reply]
Re: Finding duplicates by johndageek (Hermit) on Aug 12, 2003 at 18:24 UTC
So tell me, What does your current code, or code attempt, look like? Please post it. If you have no code, try looking up the following: split(), arrays, sorting. And if you feel like fun, try hashes. Enjoy! John	[reply]