sjd6 has asked for the wisdom of the Perl Monks concerning the following question:
I have been struggling to produce a solution to the following problem for some time (as ive only been using perl for 3 wks now). I was given some great help from BrowserUK (thanks BrowserUK), while i understand most of the code, I cannot fully understand the regex that have been used, which presents the next problem in that i cannot adapt the code to new requirements that have been given.
Here is the original problem:
"P00001" "" "LEFT01" RIGHT01" "Notes:" "" "l1:r1 l2: +r2 l3:r3 l4:r4" "M00001" "" "LEFT23" RIGHT23" "" "" "l1:r1 l2:r2 +" "A00001" "" "LEFT14" RIGHT14" "" "" "l1:r1 l2:r2 + l3:r3 l4:r4 l5:r5" "100001" "" "LEFT01" RIGHT01" "" "" "l1:r1 l2:r2 + l3:r3"
While the code does everything i asked for, i now need to adapt the code in the following way:
If a duplicate of a ‘P’ alias is found then the part number is extracted from the new record and placed inside the notes field of the record it is a duplicate of & a note of this change is made in a log file (as it does now).
If a duplicate of an ‘M’ alias is discovered, the fact is documented in the log file, but no change is made to the approved file at all.
If a duplicate of an ‘A’ alias is established, the part number is extracted from the new record and placed inside the notes field of the record it is a duplicate of & the ‘A’ alias becomes a ‘P’ alias ('A' is replaced by 'P', not a copy of the record). A note of this change is also made in the log file.
Heres the code so far:
#! perl -slw use strict; use Tie::File; my $filename = "log_duplicate.txt"; open(LOG,"+>>$filename") || die "Can't open file $filename"; # open de +stination text file # Initialise the filenames here my( $old, $new ) = ( 'approved.txt', 'unapproved.txt' ); # Comment this out, once you specify your real files above!! # genTestData( $old, $new ); # Use the files as arrays. See Tie::File. tie my @old, 'Tie::File', $old; tie my @new, 'Tie::File', $new; # Build a lookup table into the old array (file) # keyed by the catenation of fields 2, 3 & 6; my %old; $old{ join $;, ( split "\t", $old[ $_ ] )[ 2, 3, 6 ] } = $_ for 0 .. $ +#old; # Remove duplicates from the new file, if any. my %seen; @new = map{ ++$seen{ join( $;, (split '\t' )[ 2, 3 ,6 ] ) } == 1 ? $_ : () } @new; # Now process the new file line by line for my $lineno ( 0 .. $#new ) { # Split the TSV data into an array. my @fields = split "\t", $new[ $lineno ]; # And strip the quotes from the partno for later. $fields[0] =~ s["([^\x22]*)"][$1]; # Catenate the 3 key fields and do a lookup in the old data table. if( exists $old{ join $;, @fields[ 2, 3, 6 ] } ) { # If it exists, edit the line if the old file $old[ $old{ join $;, @fields[ 2, 3, 6 ] } ] # locate the notes field =~ s[ ("[^\x22]*") (?= (?: \t [^\t]*? ){2}$ ) ] { # Make a modifiable copy my $notes = $1; ## if alias is p # and append the partno to it $notes =~ s[(?<=")(.*)(?=")][$1,$fields[0]]; # And return the modified field for substituition # into the old file record. $notes; }xe; print LOG "Updating old line ", $old[ $old{ join $;, @fields[ +2, 3, 6 ] } ]; } # Else append the new record to the old file # prefixed with an 'A' else { push @old, "\"A$fields[0]\"\t$fields[1]\t$fields[2]\t$fields[3 +]\t$fields[4]\t$fields[5]\t$fields[6]"; print LOG "Adding new line '", $new[ $lineno ], "' to old file +"; } } exit(0); # This updates the files to disk and closes them. # Everything from here is for generating test data. sub genTestData { my( $old, $new ) = @_; srand( 1); open OLD, '>', $old or die $!; print OLD genLine() for 1 .. 100; close OLD; open NEW, '>', $new or die $!; print NEW genLine() for 1 .. 20; close NEW; } sub genLine{ join "\t", map{ '"' . $_ . '"' } 10000 + int rand 90000, 'dummy', ('l','r')[ rand() < 0.5 ] . int rand(9), ('l','r')[ rand() < 0.5 ] . int rand(9), 'notes', 'dummy', ''; } __END__ P:\test>295919.pl8 Adding new line '"13471" "dummy" "r1" "r1" "notes" "dummy +" ""' to old file Adding new line '"65827" "dummy" "l8" "r7" "notes" "dummy +" ""' to old file Updating old line "31648" "dummy" "l8" "r8" "notes:68098" + "dummy" "" Adding new line '"69773" "dummy" "r5" "l3" "notes" "dummy +" ""' to old file Adding new line '"94869" "dummy" "l5" "l2" "notes" "dummy +" ""' to old file Adding new line '"45724" "dummy" "r0" "l1" "notes" "dummy +" ""' to old file Updating old line "97885" "dummy" "r5" "r1" "notes:16325" + "dummy" "" Updating old line "95152" "dummy" "l4" "l6" "notes:24029" + "dummy" "" Adding new line '"49715" "dummy" "l3" "l5" "notes" "dummy +" ""' to old file Adding new line '"27962" "dummy" "l7" "r8" "notes" "dummy +" ""' to old file Adding new line '"26677" "dummy" "l5" "r5" "notes" "dummy +" ""' to old file Adding new line '"73764" "dummy" "r2" "r3" "notes" "dummy +" ""' to old file Updating old line "90568" "dummy" "l6" "l3" "notes:90576" + "dummy" "" Adding new line '"45765" "dummy" "l2" "r5" "notes" "dummy +" ""' to old file Updating old line "75975" "dummy" "l6" "l6" "notes:41819" + "dummy" "" Adding new line '"22538" "dummy" "r2" "l8" "notes" "dummy +" ""' to old file Adding new line '"43104" "dummy" "l0" "l1" "notes" "dummy +" ""' to old file Adding new line '"56614" "dummy" "l3" "l0" "notes" "dummy +" ""' to old file Adding new line '"17160" "dummy" "r0" "r2" "notes" "dummy +" ""' to old file Adding new line '"72753" "dummy" "r3" "r6" "notes" "dummy +" ""' to old file
Thanks so much for all the help
Edit by tye, convert UTF-8 to Latin-1
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Compare files & remove duplicate entries based on criteria
by Roger (Parson) on Oct 08, 2003 at 06:28 UTC | |
by Anonymous Monk on Oct 08, 2003 at 08:31 UTC |