aish has asked for the wisdom of the Perl Monks concerning the following question:

i have written a simple code for removing the duplicates from a csv file. but now i want to modify this code and want to check if duplicates are there in certain fields. ie am having fields eg-file_name,id,user,group,permissions,links,path,date,file_size etc in a csv and i want file_name,id,user,date,file_size fields to be considered to check duplicates how do i go about. and i also want to output the duplicate record to another file.
#!/usr/bin/perl -w use strict; # Set to filename of CSV file my $csvfile = ''; # Set to filename of new file(file without duplicates) my $newfile = ''; # Set to 1 if first line of CSV file contains field names, 0 otherwise my $fieldnames = 1; ### Shouldn't need to change stuff below here ### open (IN, "<$csvfile") or die "Couldn't open input CSV file: $!"; open (OUT, ">$newfile") or die "Couldn't open output file: $!"; # Read header lines if they exist my $header; $header = <IN> if $fieldnames; # Slurp in & sort everything else my @data = sort <IN>; # If we read in a header line, throw it back out again print OUT $header; my $n = 0; # Now go through the data line by line, writing it to output unless # to the previous line (in which case it's a dupe) my $lastline = ''; foreach my $currentline (@data) { next if $currentline eq $lastline; print OUT $currentline; $lastline = $currentline; $n++; } close IN; close OUT; print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";

Replies are listed 'Best First'.
Re: modifying the remove duplicates code
by toolic (Bishop) on May 03, 2011 at 13:01 UTC
    split the lines on commas into an array, then compare only the fields you're interested in:
    use warnings; use strict; my @currs; my @lasts; while (<DATA>) { @currs = (split /,/)[0..2, 7..8]; my $same = 1; if (@lasts) { for (0 .. $#currs) { $same = 0 if $currs[$_] ne $lasts[$_]; } } print unless $same; @lasts = @currs; } #0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 #file_name,id,user,group,permissions,links,path,date,file_size #file_name,id,user, date,file_size __DATA__ foo,123,me,us,777,golf,/home/me,5/3/11,100 foo,123,me,them,666,four,/home/you,5/3/11,100
      i did get an output. how do i split it?
Re: modifying the remove duplicates code
by locked_user sundialsvc4 (Abbot) on May 03, 2011 at 12:32 UTC

    As you seem to be already doing, sort the file on the key(s) that you wish to compare, then compare each record to “the previous record” as before.   The sort verb allows you to provide a record-comparison function so that you can sort on multiple fields.

    See also:   Text::Record::Deduper.   Yup, there’s a CPAN module for everything.