bedohave9 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, We have a .NET application to be tested in Perl. We are working on .NET 4.0 version and on the Windows 7 OS. My requirement is to identify the duplicate records in a text file. We have nearly 1000 of these files to be done and time frame is very very less. A sample line of data in the file : ABC Mutual Funds |00000111|001110407033|011000073|FRF|CG12088886S11086: ABC Mutual Funds|665.26
: #!/usr/bin/perl use strict; use warnings; use File::IO; my %no_dupes; while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); } $result_file="C:\Users\spullabhotla\Desktop\TestToRemoveDuplicates.txt +"; open (RESULT_FILE,">$result_file"); foreach my $line (sort {$no_dupes{$a} <=> $no_dupes{$b}} keys %no_dupe +s){ print RESULT_FILE "$line\n"; }
When I executed the query the console threw an error : Can't locate File/IO.pm in @INC (@INC contains: C:/Perl64/site/lib C:/Perl64/lib .) at C:\Perl64\Perl Programs\Learning\remit_dup.pl line 5. BEGIN failed--compilation aborted at C:\Perl64\Perl Programs\Learning\remit_dup. pl line 5. Please let me know if I need to change anything in the code.
  • Comment on Remove Duplicates from one text file and print them in another text file.
  • Download Code

Replies are listed 'Best First'.
Re: Remove Duplicates from one text file and print them in another text file.
by toolic (Bishop) on Jun 04, 2012 at 18:05 UTC
    Can't locate File/IO.pm in @INC
    That error means that you told perl to load in the File::IO module, but it could not be found in the search path. Since your code doesn't seem to rely on that module anyway, try deleting the following line:
    use File::IO;

    Please edit your post to use code tags for your code and error messages: Writeup Formatting Tips

    To clean up other warnings and errors, change:

    $result_file = "C:\Users\spullabhotla\Desktop\TestToRemoveDuplicates.t +xt";
    to (my and single quotes):
    my $result_file = 'C:\Users\spullabhotla\Desktop\TestToRemoveDuplicate +s.txt';

    BTW, there is a Core module named IO::File, if that's what you are trying to use.

      Thank you. It did work for me. The code did execute without, but the console is showing a bar blinking on the screen. Does it means, there are no duplicates in the file. Please let me know. I am kind of confused with the console behavior. Also, I could not able to find the IO::File in the Perl Package Manager. Can I try copying the code in IO::File in a file and save it in the library with the Perl extension. Would that work?

        Also, I could not able to find the IO::File in the Perl Package Manager.
        IO::File is a Core module, which means that it should be available to you. Try this at your command prompt:
        perldoc IO::File
Re: Remove Duplicates from one text file and print them in another text file.
by aaron_baugher (Curate) on Jun 04, 2012 at 18:48 UTC

    Since your time frame is very very less, and presumably this isn't homework that has to be done in Perl, you might want to avoid reinventing the wheel (utilities available in any *nix system including Cygwin):

    sort file.txt | uniq -d > duplicates.txt

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      I am working on .NET 4.0 and Windows 7 is my OS. The code threw an error. Probably this is for Unix/Linux OS. Please correct me if I am wrong. Backslash found where operator expected at C:\Programs\CPGD\PerlScripts\remdup1.pl line 4, near "Users\" Backslash found where operator expected at C:\Programs\CPGD\PerlScripts\remdup1.pl line 4, near "spullabhotla\" Backslash found where operator expected at C:\Programs\CPGD\PerlScripts\remd up1.pl line 4, near "Desktop\" syntax error at C:\Programs\CPGD\PerlScripts\remd\remdup1.pl line 4, near "sort C :" Execution of C:\Perl64\Perl Programs\Learning\remdup1.pl aborted due to compilation errors.

        I am working on .NET 4.0 and Windows 7 is my OS. The code threw an error. Probably this is for Unix/Linux OS. Please correct me if I am wrong.

        Backslash found where operator expected at C:\Programs\CPGD\PerlScript +s\remdup1.pl line 4, near "Users\" Backslash found where operator exp +ected at C:\Programs\CPGD\PerlScripts\remdup1.pl line 4, near "spulla +bhotla\" Backslash found where operator expected at C:\Programs\CPGD\ +PerlScripts\remd up1.pl line 4, near "Desktop\" syntax error at C:\Pr +ograms\CPGD\PerlScripts\remd\remdup1.pl line 4, near "sort C :" Execu +tion of C:\Perl64\Perl Programs\Learning\remdup1.pl aborted due to co +mpilation errors.

        Yes, utilities like sort and uniq are available in Unix/Linux operating systems. On Windows, you can install a package called Cygwin, which will give you a Unix-like environment and command line, plus those common utilities.

        Aaron B.
        Available for small or large Perl jobs; see my home node.

Re: Remove Duplicates from one text file and print them in another text file.
by davido (Cardinal) on Jun 04, 2012 at 18:35 UTC

    My requirement is to identify the duplicate records in a text file.

    while (<>) { chomp; $no_dupes{$_} = $. if (! exists $no_dupes{$_}); }

    ...and what you're doing is identifying the location of a single duplicate for each line content. Is it reasonable to ignore the possibility of more than one duplicate of a given line? (It may be fine; I don't know about your data set. Or it may be a bug waiting to be discovered.) The code you posted is fine if you only care that to know the content of any line that is duplicated somewhere in the file. But it falls short if you care about how many times, and where those duplicates are found.

    To allow for the possibility of flagging more than one duplicate, you might try this instead:

    use strict; use warnings; my %folded_lines; while ( <> ) { chomp; push @{$folded_lines{$_}}, $.; } my @dupes = sort { $folded_lines{$a} <=> $folded_lines{$b} } grep { @{ $folded_lines{$_} } > 1 } keys %folded_lines; print RESULT_FILE $_, ': ', join( ', ', @$folded_lines{$_} ), "\n" for + @dupes;

    Every unique line of the file will become a hash key. Those lines that are repeated will be detected, and a list of all lines included in the unions will be listed. If you want to remove the first occurrence from the list, that becomes simple as well. It would look like this:

    foreach ( @dupes ) { print RESULT_FILE $_, ': ', join( ', ', @{$folded_lines{$_}}[ 1 .. $#{$folded_lines{$_}} ] + ), "\n"; }

    Sample output might look like this:

    don't panic: 34, 55, 89, 144

    Dave

      The code did execute without errors, but the console is showing a bar blinking on the screen. I have been trying with many files, but was returning with the a horizontal cursor blinking in the console and it stands still even after 5-6 minutes.

        How big are these files?


        Dave