Compare files & remove duplicate entries based on criteria

sjd6 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks, hopefully your wisdom & superior knowledge can shed some light on the problem I have...

I have been struggling to produce a solution to the following problem for some time (as ive only been using perl for 3 wks now). I was given some great help from BrowserUK (thanks BrowserUK), while i understand most of the code, I cannot fully understand the regex that have been used, which presents the next problem in that i cannot adapt the code to new requirements that have been given.

Here is the original problem:

A TSV (tab separated) file contains data produced from testing cables in custom built hardware. This data consists of 7 fields: alias; empty string; left connector, right connector, notes, empty string, ping.

"P00001"    ""    "LEFT01"    RIGHT01"    "Notes:"    ""    "l1:r1 l2:
+r2 l3:r3 l4:r4"
"M00001"    ""    "LEFT23"    RIGHT23"    ""        ""    "l1:r1 l2:r2
+"
"A00001"    ""    "LEFT14"    RIGHT14"    ""        ""    "l1:r1 l2:r2
+ l3:r3 l4:r4 l5:r5"
"100001"    ""    "LEFT01"    RIGHT01"    ""        ""    "l1:r1 l2:r2
+ l3:r3"
[download]

New test data added and the data that existed before is identified and split into separate files for comparison, approved & unapproved. This is done on the assumption that all new test data does not start with a ‘P’, ‘M’ or ‘A’ alias (the code below doesnt cater for this at the moment, but i have a solution that does the job in a separate script).
Each line in the unapproved file is compared to every line in the approved file to check for duplicate data. Three fields that are evaluated in the comparison: left connector, right connector and ping fields.
If all three fields match a record (line) in the approved file, this record is a duplicate
if a duplicate is found then the part number (alias) is extracted from the record (in the unapproved file) and placed inside the notes field of the record it was a duplicate of (in the approved file) & a note of this change is made in a log file.
If all three fields do not match a record then the new record is not a duplicate and therefore must be approved. This is done simply by adding an ‘A’ in front of the part number (contained in the alias field). Again a note of this should be made in the log file.
The final output will be in the approved file, with all changes that have occurred in the log file.

While the code does everything i asked for, i now need to adapt the code in the following way:
If a duplicate of a ‘P’ alias is found then the part number is extracted from the new record and placed inside the notes field of the record it is a duplicate of & a note of this change is made in a log file (as it does now).
If a duplicate of an ‘M’ alias is discovered, the fact is documented in the log file, but no change is made to the approved file at all.
If a duplicate of an ‘A’ alias is established, the part number is extracted from the new record and placed inside the notes field of the record it is a duplicate of & the ‘A’ alias becomes a ‘P’ alias ('A' is replaced by 'P', not a copy of the record). A note of this change is also made in the log file.

Heres the code so far:

#! perl -slw

use strict;
use Tie::File;

my $filename = "log_duplicate.txt";
open(LOG,"+>>$filename") || die "Can't open file $filename"; # open de
+stination text file

# Initialise the filenames here
my( $old, $new ) = ( 'approved.txt', 'unapproved.txt' );

# Comment this out, once you specify your real files above!!
# genTestData( $old, $new );

# Use the files as arrays. See Tie::File.
tie my @old, 'Tie::File', $old;
tie my @new, 'Tie::File', $new;

# Build a lookup table into the old array (file)
# keyed by the catenation of fields 2, 3 & 6;
my %old;
$old{ join $;, ( split "\t", $old[ $_ ] )[ 2, 3, 6 ] } = $_ for 0 .. $
+#old;

# Remove duplicates from the new file, if any.
my %seen;
@new = map{ 
    ++$seen{ join( $;, (split '\t' )[ 2, 3 ,6 ] ) } == 1 
    ? $_ : () 
} @new;


# Now process the new file line by line
for my $lineno ( 0 .. $#new ) {

    # Split the TSV data into an array.
    my @fields = split "\t", $new[ $lineno ];

    # And strip the quotes from the partno for later.
    $fields[0] =~ s["([^\x22]*)"][$1];

    # Catenate the 3 key fields and do a lookup in the old data table.
    if( exists $old{ join $;, @fields[ 2, 3, 6 ] } ) {

        # If it exists, edit the line if the old file
        $old[ $old{ join $;, @fields[ 2, 3, 6 ] } ]

            # locate the notes field
            =~ s[ ("[^\x22]*") (?= (?: \t [^\t]*? ){2}$ ) ]
                {
                    # Make a modifiable copy
                    my $notes = $1;
                    
                    ## if alias is p

                    # and append the partno to it
                    $notes =~ s[(?<=")(.*)(?=")][$1,$fields[0]];

                    # And return the modified field for substituition
                    # into the old file record.
                    $notes;
                }xe;
        print LOG "Updating old line ", $old[ $old{ join $;, @fields[ 
+2, 3, 6 ] } ];
    }
    # Else append the new record to the old file
    # prefixed with an 'A'
    else {
        push @old, "\"A$fields[0]\"\t$fields[1]\t$fields[2]\t$fields[3
+]\t$fields[4]\t$fields[5]\t$fields[6]";
        print LOG "Adding new line '", $new[ $lineno ], "' to old file
+";
    }
}

exit(0); # This updates the files to disk and closes them.

# Everything from here is for generating test data.
sub genTestData {
    my( $old, $new ) = @_;
    srand( 1);
    open OLD, '>', $old or die $!;
    print OLD genLine() for 1 .. 100;
    close OLD;
    open NEW, '>', $new or die $!;
    print NEW genLine() for 1 .. 20;
    close NEW;
}

sub genLine{
    join "\t", map{ '"' . $_ . '"' }
        10000 + int rand 90000,
        'dummy',
        ('l','r')[ rand() < 0.5 ] . int rand(9),
        ('l','r')[ rand() < 0.5 ] . int rand(9),
        'notes',
        'dummy',
        '';
}

__END__
P:\test>295919.pl8
Adding new line '"13471"        "dummy" "r1"    "r1"    "notes" "dummy
+" ""' to old file
Adding new line '"65827"        "dummy" "l8"    "r7"    "notes" "dummy
+" ""' to old file
Updating old line "31648"       "dummy" "l8"    "r8"    "notes:68098" 
+  "dummy" ""
Adding new line '"69773"        "dummy" "r5"    "l3"    "notes" "dummy
+" ""' to old file
Adding new line '"94869"        "dummy" "l5"    "l2"    "notes" "dummy
+" ""' to old file
Adding new line '"45724"        "dummy" "r0"    "l1"    "notes" "dummy
+" ""' to old file
Updating old line "97885"       "dummy" "r5"    "r1"    "notes:16325" 
+  "dummy" ""
Updating old line "95152"       "dummy" "l4"    "l6"    "notes:24029" 
+  "dummy" ""
Adding new line '"49715"        "dummy" "l3"    "l5"    "notes" "dummy
+" ""' to old file
Adding new line '"27962"        "dummy" "l7"    "r8"    "notes" "dummy
+" ""' to old file
Adding new line '"26677"        "dummy" "l5"    "r5"    "notes" "dummy
+" ""' to old file
Adding new line '"73764"        "dummy" "r2"    "r3"    "notes" "dummy
+" ""' to old file
Updating old line "90568"       "dummy" "l6"    "l3"    "notes:90576" 
+  "dummy" ""
Adding new line '"45765"        "dummy" "l2"    "r5"    "notes" "dummy
+" ""' to old file
Updating old line "75975"       "dummy" "l6"    "l6"    "notes:41819" 
+  "dummy" ""
Adding new line '"22538"        "dummy" "r2"    "l8"    "notes" "dummy
+" ""' to old file
Adding new line '"43104"        "dummy" "l0"    "l1"    "notes" "dummy
+" ""' to old file
Adding new line '"56614"        "dummy" "l3"    "l0"    "notes" "dummy
+" ""' to old file
Adding new line '"17160"        "dummy" "r0"    "r2"    "notes" "dummy
+" ""' to old file
Adding new line '"72753"        "dummy" "r3"    "r6"    "notes" "dummy
+" ""' to old file
[download]

Thanks so much for all the help

Edit by tye, convert UTF-8 to Latin-1

Comment on Compare files & remove duplicate entries based on criteria Select or Download Code

Replies are listed 'Best First'.
Re: Compare files & remove duplicate entries based on criteria by Roger (Parson) on Oct 08, 2003 at 06:28 UTC
Have you thought about doing this with a database approach using DBD::CSV yet? I have posted a working example of using the DBD::CSV driver here. Just thought you might be interested to have a look at this technique.	[reply]
Re: Re: Compare files & remove duplicate entries based on criteria by Anonymous Monk on Oct 08, 2003 at 08:31 UTC
Will it work for a tsv file or would i have to convert the tsv file into a csv file?	[reply]