Dave_PA has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

Here is the scenario of the file that I have to work with. Every week, I get copy of seven files one file for each day of the week. What I need to do is to combined the files into one file keeping them in order (the oldest file 'first', etc with the newest file last). I got this part working. The next piece that I have to do is to search for duplicate records, but with a twist.

For a lack of a better term the 'primary key' for the record is within the characters 9-13 in the row. So if any information is updated greater than the 13th character the record needs updated. But wait, it get's better. Say there was an update made on a Monday, then on a Friday to a record. When the file is combined, I need the newest inserted which would be the record inserted on Friday.

An example would be this:

This would be on line 10 so it would be something earlier in the week

542642  19779   SAMMYs  17TH ST

on line 1500 this would be listed
542642  19779   SAMMYs  Sesame ST

So what I would like is the SAMMYs at 17th gone, and only have the listing at Sesame ST. Also it’s the 19779 would be what let's you know that it's the same store.

So here is where I’m at now. I searched through previous monk posts and found some really good stuff on finding and removing duplicate elements in an array.

http://www.perlmonks.org/?node_id=280484

Which got me to

http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array%3F

So here is what I did. I put it into an array, reversed the array (so that instead of oldest first, it was newest first) then did the search. Then I re-reversed the array putting the oldest first again.

open (FILE, file.txt' || die "can't open file \n\n $!"); @FileInfo=<$file_name>; @newFile=reverse(@FileInfo); my %seen = (); my @unique = grep { ! $seen{ $_ }++ } @newFile; @newFile=reverse(@unique); print FILE (@newFile); close (FILE);


My problem is that I can't find a good example that does a hash/grep based on a 'primary key'.

I think I'm close, but really need the help and assistance from more well versed monks than I to make this truly work right.

thanks!
Dave

Replies are listed 'Best First'.
Re: how to remove similiar duplicate elements in a file/array
by ikegami (Patriarch) on Dec 04, 2008 at 17:01 UTC

    You want to check if you've seen the key (not the line), so the key is what you should be putting into the hash. For some definition of extract_key,

    my @unique = grep { my $key = extract_key($_); ! $seen{ $key }++ } @newFile;

    In this case,

    my @unique = grep !$seen{ substr($_, 9, 5) }++, @newFile;
Re: how to remove similiar duplicate elements in a file/array
by Limbic~Region (Chancellor) on Dec 04, 2008 at 20:10 UTC
    Dave_PA,
    Here is how I would do it as a general solution which is a balance of reduced complexity, IO, and memory consumption (untested).
    #!/usr/bin/perl use strict; use warnings; use File::ReadBackwards; my @input = qw/sun.txt mon.txt tue.txt wed.txt thu.txt fri.txt sat.txt +/; my %seen; open(my $rev_out_fh, '>', 'output.rev') or die "Unable to open 'output +.rev' for writing: $!"; for my $file (reverse @input) { my $bw = File::ReadBackwards->new($file) or die "can't read '$file +' $!"; while (defined(my $line) = $bw->readline)) { my $id = substr($line, 8, 5); next if $seen{$id}++; print $rev_out_fh $line; } } open(my $out_fh, '>', 'output.txt') or die "Unable to open 'output.txt +' for writing: $!"; my $bw = File::ReadBackwards->new('output.rev') or die "can't read 'ou +tput.rev' $!"; while (defined(my $line) = $bw->readline)) { print $out_fh $line; } unlink 'output.rev'; # you may care if this fails
    If everything fits into memory then this is probably unnecessary.

    Cheers - L~R

Re: how to remove similiar duplicate elements in a file/array
by repellent (Priest) on Jan 08, 2009 at 05:40 UTC
    If you're on a UNIX system, you can use standard coreutils to achieve the job:
    nl -s '|' with_dups.txt | sort -u -t '|' -k 9,13 | sort -n -b | cut -d + '|' -f 2- > no_dups.txt