comment on

hehe... I see now where:

cp ($FH_A, $FH_B);
[download]

should be:

cp ($file_name_a, $file_name_b);
[download]

So here is what I came up with.

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my %seen;

open my $FHIN, '<', $ARGV[0]                    or die $!;
open my $FHNEW, '>', "$ARGV[0].new.csv"  or die $!;
open my $FHDEL, '>', "$ARGV[0].deleted.csv"     or die $!;

foreach my $line (<$FHIN>){
    my ($key, $rest) = split/,/, $line, 2;
    $key =~ s/ [-&_+'] / /msx;
    $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx;

    ($seen{$key}++) ?
        print $FHDEL "DUP, $line"  :
        print $FHNEW "$key,$rest";
}

close $FHNEW, $FHDEL;
[download]

this works great if the search key is repeated. what if I have a key that is misspelled etc. i.e.:

___DATA___
Group Onne,Captain,Phone Number,League Pos,etc.
Group Oneffdfadsf,Captain,Phone Number,League Pos,etc.
GroupOneeroneouskunk,Captain,Phone Number,League Pos,etc.
Group Two,Captain,Phone Number,League Pos,etc.
Group Three,Captain,Phone Number,League Pos,etc.
[download]

where the first part of the name is correct but there is potentially more junk at the end of the name. is there a way to match part of the string and if part of the string matches call it a dup?
something like:

$seen{$key} =~ m/$key+,/ ? print DUP : print NEW;
[download]

the problem is the incoming data isn't consistent. ie there are ten cols across in the CSV, of the ten cols between 4 and 10 are filled in, so comparing the data is not a viable method for sorting DUP entries.
Stylistically, I've always used all caps to represent files, besides STDERR and STDOUT are just glorified file handles anyway, and they use full caps. I understand that lexically scoped file handles are not global variables and that's the differentiation you make--some habits.
Again, thanks for the assistance.

In reply to Re: Remove duplicate entries by PyrexKidd
in thread Remove duplicate entries by PyrexKidd

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.