how to remove similiar duplicate elements in a file/array

Dave_PA has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

Here is the scenario of the file that I have to work with. Every week, I get copy of seven files one file for each day of the week. What I need to do is to combined the files into one file keeping them in order (the oldest file 'first', etc with the newest file last). I got this part working. The next piece that I have to do is to search for duplicate records, but with a twist.

For a lack of a better term the 'primary key' for the record is within the characters 9-13 in the row. So if any information is updated greater than the 13th character the record needs updated. But wait, it get's better. Say there was an update made on a Monday, then on a Friday to a record. When the file is combined, I need the newest inserted which would be the record inserted on Friday.

An example would be this:

This would be on line 10 so it would be something earlier in the week

542642 19779 SAMMYs 17TH ST

on line 1500 this would be listed
542642 19779 SAMMYs Sesame ST

So what I would like is the SAMMYs at 17th gone, and only have the listing at Sesame ST. Also it’s the 19779 would be what let's you know that it's the same store.

So here is where I’m at now. I searched through previous monk posts and found some really good stuff on finding and removing duplicate elements in an array.

http://www.perlmonks.org/?node_id=280484

Which got me to

http://perldoc.perl.org/perlfaq4.html#How-can-I-remove-duplicate-elements-from-a-list-or-array%3F

So here is what I did. I put it into an array, reversed the array (so that instead of oldest first, it was newest first) then did the search. Then I re-reversed the array putting the oldest first again.

    open (FILE, file.txt' || die "can't open file \n\n $!");
    @FileInfo=<$file_name>;
    @newFile=reverse(@FileInfo);
    my %seen = ();
    my @unique = grep { ! $seen{ $_ }++ } @newFile;
    @newFile=reverse(@unique);
    
    print FILE (@newFile);
    close (FILE);
[download]

My problem is that I can't find a good example that does a hash/grep based on a 'primary key'.

I think I'm close, but really need the help and assistance from more well versed monks than I to make this truly work right.

thanks!
Dave

Comment on how to remove similiar duplicate elements in a file/array Select or Download Code

Replies are listed 'Best First'.
Re: how to remove similiar duplicate elements in a file/array by ikegami (Patriarch) on Dec 04, 2008 at 17:01 UTC
You want to check if you've seen the key (not the line), so the key is what you should be putting into the hash. For some definition of `extract_key`, `my @unique = grep { my $key = extract_key($_); ! $seen{ $key }++ } @newFile;` [download] In this case, `my @unique = grep !$seen{ substr($_, 9, 5) }++, @newFile;` [download]	[reply] [d/l] [select]
Re: how to remove similiar duplicate elements in a file/array by Limbic~Region (Chancellor) on Dec 04, 2008 at 20:10 UTC
Dave_PA, Here is how I would do it as a general solution which is a balance of reduced complexity, IO, and memory consumption (untested). #!/usr/bin/perl use strict; use warnings; use File::ReadBackwards; my @input = qw/sun.txt mon.txt tue.txt wed.txt thu.txt fri.txt sat.txt +/; my %seen; open(my $rev_out_fh, '>', 'output.rev') or die "Unable to open 'output +.rev' for writing: $!"; for my $file (reverse @input) { my $bw = File::ReadBackwards->new($file) or die "can't read '$file +' $!"; while (defined(my $line) = $bw->readline)) { my $id = substr($line, 8, 5); next if $seen{$id}++; print $rev_out_fh $line; } } open(my $out_fh, '>', 'output.txt') or die "Unable to open 'output.txt +' for writing: $!"; my $bw = File::ReadBackwards->new('output.rev') or die "can't read 'ou +tput.rev' $!"; while (defined(my $line) = $bw->readline)) { print $out_fh $line; } unlink 'output.rev'; # you may care if this fails [download] If everything fits into memory then this is probably unnecessary. Cheers - L~R	[reply] [d/l]
Re: how to remove similiar duplicate elements in a file/array by repellent (Priest) on Jan 08, 2009 at 05:40 UTC
If you're on a UNIX system, you can use standard coreutils to achieve the job: `nl -s '\|' with_dups.txt \| sort -u -t '\|' -k 9,13 \| sort -n -b \| cut -d + '\|' -f 2- > no_dups.txt` [download]	[reply] [d/l]