Remove Duplicates from one text file and print them in another text file.

bedohave9 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove Duplicates from one text file and print them in another text file. by toolic (Bishop) on Jun 04, 2012 at 18:05 UTC
Can't locate File/IO.pm in @INC That error means that you told perl to load in the `File::IO` module, but it could not be found in the search path. Since your code doesn't seem to rely on that module anyway, try deleting the following line: `use File::IO;` [download] Please edit your post to use code tags for your code and error messages: Writeup Formatting Tips To clean up other warnings and errors, change: `$result_file = "C:\Users\spullabhotla\Desktop\TestToRemoveDuplicates.t +xt";` [download] to (my and single quotes): `my $result_file = 'C:\Users\spullabhotla\Desktop\TestToRemoveDuplicate +s.txt';` [download] BTW, there is a Core module named IO::File, if that's what you are trying to use.	[reply] [d/l] [select]
Re^2: Remove Duplicates from one text file and print them in another text file. by bedohave9 (Acolyte) on Jun 04, 2012 at 18:25 UTC
Thank you. It did work for me. The code did execute without, but the console is showing a bar blinking on the screen. Does it means, there are no duplicates in the file. Please let me know. I am kind of confused with the console behavior. Also, I could not able to find the IO::File in the Perl Package Manager. Can I try copying the code in IO::File in a file and save it in the library with the Perl extension. Would that work?	[reply]
Re^3: Remove Duplicates from one text file and print them in another text file. by toolic (Bishop) on Jun 04, 2012 at 18:44 UTC
Also, I could not able to find the IO::File in the Perl Package Manager. IO::File is a Core module, which means that it should be available to you. Try this at your command prompt: `perldoc IO::File` [download]	[reply] [d/l]
Re^4: Remove Duplicates from one text file and print them in another text file. by bedohave9 (Acolyte) on Jun 04, 2012 at 20:51 UTC
Re: Remove Duplicates from one text file and print them in another text file. by aaron_baugher (Curate) on Jun 04, 2012 at 18:48 UTC
Since your time frame is very very less, and presumably this isn't homework that has to be done in Perl, you might want to avoid reinventing the wheel (utilities available in any *nix system including Cygwin): `sort file.txt \| uniq -d > duplicates.txt` [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l]
Re^2: Remove Duplicates from one text file and print them in another text file. by bedohave9 (Acolyte) on Jun 05, 2012 at 17:07 UTC
I am working on .NET 4.0 and Windows 7 is my OS. The code threw an error. Probably this is for Unix/Linux OS. Please correct me if I am wrong. Backslash found where operator expected at C:\Programs\CPGD\PerlScripts\remdup1.pl line 4, near "Users\" Backslash found where operator expected at C:\Programs\CPGD\PerlScripts\remdup1.pl line 4, near "spullabhotla\" Backslash found where operator expected at C:\Programs\CPGD\PerlScripts\remd up1.pl line 4, near "Desktop\" syntax error at C:\Programs\CPGD\PerlScripts\remd\remdup1.pl line 4, near "sort C :" Execution of C:\Perl64\Perl Programs\Learning\remdup1.pl aborted due to compilation errors.	[reply]
Re^3: Remove Duplicates from one text file and print them in another text file. by bedohave9 (Acolyte) on Jun 05, 2012 at 17:09 UTC
I am working on .NET 4.0 and Windows 7 is my OS. The code threw an error. Probably this is for Unix/Linux OS. Please correct me if I am wrong. Backslash found where operator expected at C:\Programs\CPGD\PerlScript +s\remdup1.pl line 4, near "Users\" Backslash found where operator exp +ected at C:\Programs\CPGD\PerlScripts\remdup1.pl line 4, near "spulla +bhotla\" Backslash found where operator expected at C:\Programs\CPGD\ +PerlScripts\remd up1.pl line 4, near "Desktop\" syntax error at C:\Pr +ograms\CPGD\PerlScripts\remd\remdup1.pl line 4, near "sort C :" Execu +tion of C:\Perl64\Perl Programs\Learning\remdup1.pl aborted due to co +mpilation errors. [download]	[reply] [d/l]
Re^3: Remove Duplicates from one text file and print them in another text file. by aaron_baugher (Curate) on Jun 06, 2012 at 17:46 UTC
Yes, utilities like `sort` and `uniq` are available in Unix/Linux operating systems. On Windows, you can install a package called Cygwin, which will give you a Unix-like environment and command line, plus those common utilities. Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re^4: Remove Duplicates from one text file and print them in another text file. by afoken (Chancellor) on Jun 06, 2012 at 18:14 UTC
Re: Remove Duplicates from one text file and print them in another text file. by davido (Cardinal) on Jun 04, 2012 at 18:35 UTC
My requirement is to identify the duplicate records in a text file. `while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); }` [download] ...and what you're doing is identifying the location of a single duplicate for each line content. Is it reasonable to ignore the possibility of more than one duplicate of a given line? (It may be fine; I don't know about your data set. Or it may be a bug waiting to be discovered.) The code you posted is fine if you only care that to know the content of any line that is duplicated somewhere in the file. But it falls short if you care about how many times, and where those duplicates are found. To allow for the possibility of flagging more than one duplicate, you might try this instead: `use strict; use warnings; my %folded_lines; while ( <> ) { chomp; push @{$folded_lines{$_}}, $.; } my @dupes = sort { $folded_lines{$a} <=> $folded_lines{$b} } grep { @{ $folded_lines{$_} } > 1 } keys %folded_lines; print RESULT_FILE $_, ': ', join( ', ', @$folded_lines{$_} ), "\n" for + @dupes;` [download] Every unique line of the file will become a hash key. Those lines that are repeated will be detected, and a list of all lines included in the unions will be listed. If you want to remove the first occurrence from the list, that becomes simple as well. It would look like this: `foreach ( @dupes ) { print RESULT_FILE $_, ': ', join( ', ', @{$folded_lines{$_}}[ 1 .. $#{$folded_lines{$_}} ] + ), "\n"; }` [download] Sample output might look like this: `don't panic: 34, 55, 89, 144` [download] Dave	[reply] [d/l] [select]
Re^2: Remove Duplicates from one text file and print them in another text file. by bedohave9 (Acolyte) on Jun 04, 2012 at 21:12 UTC
The code did execute without errors, but the console is showing a bar blinking on the screen. I have been trying with many files, but was returning with the a horizontal cursor blinking in the console and it stands still even after 5-6 minutes.	[reply]
Re^3: Remove Duplicates from one text file and print them in another text file. by davido (Cardinal) on Jun 04, 2012 at 22:13 UTC
How big are these files? Dave	[reply]
Re^4: Remove Duplicates from one text file and print them in another text file. by bedohave9 (Acolyte) on Jun 05, 2012 at 17:05 UTC
Re^5: Remove Duplicates from one text file and print them in another text file. by davido (Cardinal) on Jun 05, 2012 at 18:43 UTC