modifying the remove duplicates code

aish has asked for the wisdom of the Perl Monks concerning the following question:

i have written a simple code for removing the duplicates from a csv file. but now i want to modify this code and want to check if duplicates are there in certain fields. ie am having fields eg-file_name,id,user,group,permissions,links,path,date,file_size etc in a csv and i want file_name,id,user,date,file_size fields to be considered to check duplicates how do i go about. and i also want to output the duplicate record to another file.

#!/usr/bin/perl -w



use strict;

# Set to filename of CSV file
my $csvfile = '';

# Set to filename of new file(file without duplicates)

my $newfile = '';

# Set to 1 if first line of CSV file contains field names, 0 otherwise
my $fieldnames = 1;

### Shouldn't need to change stuff below here ###

open (IN, "<$csvfile")  or die "Couldn't open input CSV file: $!";

open (OUT, ">$newfile") or die "Couldn't open output file: $!";

# Read header lines if they exist
my $header;
$header = <IN> if $fieldnames;

# Slurp in & sort everything else

my @data = sort <IN>;

# If we read in a header line, throw it back out again
print OUT $header;

my $n = 0;
# Now go through the data line by line, writing it to output unless


# to the previous line (in which case it's a dupe)
my $lastline = '';
foreach my $currentline (@data) {
  next if $currentline eq $lastline;
  print OUT $currentline;
  $lastline = $currentline;

  $n++;
}

close IN; close OUT;

print "Processing complete. In = " . scalar @data . " records, Out =
$n records\n";
[download]

Comment on modifying the remove duplicates code Download Code

Replies are listed 'Best First'.
Re: modifying the remove duplicates code by toolic (Bishop) on May 03, 2011 at 13:01 UTC
split the lines on commas into an array, then compare only the fields you're interested in: `use warnings; use strict; my @currs; my @lasts; while (<DATA>) { @currs = (split /,/)[0..2, 7..8]; my $same = 1; if (@lasts) { for (0 .. $#currs) { $same = 0 if $currs[$_] ne $lasts[$_]; } } print unless $same; @lasts = @currs; } #0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 #file_name,id,user,group,permissions,links,path,date,file_size #file_name,id,user, date,file_size __DATA__ foo,123,me,us,777,golf,/home/me,5/3/11,100 foo,123,me,them,666,four,/home/you,5/3/11,100` [download]	[reply] [d/l]
Re^2: modifying the remove duplicates code by aish (Initiate) on May 04, 2011 at 09:50 UTC
i did get an output. how do i split it?	[reply]
Re: modifying the remove duplicates code by locked_user sundialsvc4 (Abbot) on May 03, 2011 at 12:32 UTC
As you seem to be already doing, `sort` the file on the key(s) that you wish to compare, then compare each record to “the previous record” as before. The `sort` verb allows you to provide a record-comparison function so that you can sort on multiple fields. See also: Text::Record::Deduper. Yup, there’s a CPAN module for everything.