http://qs1969.pair.com?node_id=718517

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have a file for which I need to remove some rows for which there is an element of duplication.
Example file (tiny compared to the real thing).
d1 c1.1 f1 d1.1 d1 c1.1 f2 d1.2 d2 c1.1 f1 d1.1 d3 c1.1 f1 d1.1 d4 c1.1 f1 d1.1 d4 c1.1 f2 d1.2 d5 c1.1 f4 d1.4 d6 c1.1 f5 d1.5
For each c1.1 group, I want a print out whereby for each c1.? all duplicate d1.? entries are removed. In other words I'm after something like this
d1 c1.1 f1 d1.1 d1 c1.1 f2 d1.2 d5 c1.1 f4 d1.4 d6 c1.1 f5 d1.5
The print out should include all four columns Here is what I've attempted so far
#!/usr/bin/perl -w use strict; use warnings; use English; use FileHandle; use Exception; my ($fIn) = $ARGV[0]; open(FILE, "$fIn") || die "ERROR: Can't open $fIn file: $!\n"; my %hash; my $c_id; my $d_id; my $f_var; while(<FILE>) { chomp; my @data = split(/\s+/, $_); $c_id = $data[1]; $d_id = $data[3]; $f_var = $data[2]; if(!$hash{$c_id}{$f_var}) { $hash{$c_id}{$f_var} = $d_id; } } while (( my $k1, my $k2) = each %hash) { print "$k1 "; while (( $k2, my $k3) = each %$k2) { print "$k2 $k3 "; } print "\n"; }
But sadly I'm getting an error about not being able to use a string as a HASH ref while 'strict refs' are in use. Could someone please point me in the right direction? Thanks

Replies are listed 'Best First'.
Re: Trying to remove duplicate rows using hashes
by kyle (Abbot) on Oct 21, 2008 at 15:52 UTC

    When you output, you use the value of $k2 as a hash reference, but the second time through the inner loop, its value has been replaced by the key that each returned on the first time through. You need something more like this:

    while ( my ($k1, $href) = each %hash ) { while ( my ($k2, $k3) = each %{ $href } ) { } }

    Incidentally, you use English without the important -no_match_vars option, and you use warnings as well as giving the -w, which is a bit redundant.

    Also, your sample output is in the order that the lines were received, but you won't get your output in that order if you're using each and hashes for storage. There are ways of coping with that, but I can't tell if that's one of your requirements or not.

      Thanks for all your help so far. Much appreciated.

      As regards to the print out - you mean I have to use some kind of sort function if I want it looking like my example?

        Yes, you could sort them before you output them, or you can sort them separately in another process. The UNIX sort command is very effective for tabular data such as yours.

Re: Trying to remove duplicate rows using hashes
by jettero (Monsignor) on Oct 21, 2008 at 15:45 UTC

    Offtopic, sorry, I'm sure an actual answer will be along very shotly.

    You don't really use any of these ... Why load them?

    use English; use FileHandle; use Exception;

    Oh, I have an actual answer I guess. It seems you're stomping on your $k2 with your second each() call...

    -Paul

      oh - well i guess I just cut and paste those from other scripts - but yes in this instance i guess they are rather redundant. EDIT: what do you mean by the "stomping"?
        EDIT: what do you mean by the "stomping"?

        I'm sure others have already answers this...

        I mean that your second each call is overwriting your $k2 value so your next call to each is operating on the value you replace $k2 with instead of the hashref you mean to have there.

        Stomping is a colloquial expression, by which I mean to stay "accidentally replacing" or "stepping on."

        -Paul

Re: Trying to remove duplicate rows using hashes
by ccn (Vicar) on Oct 21, 2008 at 16:06 UTC
    One line:
    ccn@laptop:~$ perl -lane '$H{$F[1].$F[3]}++ or print' file.txt d1 c1.1 f1 d1.1 d1 c1.1 f2 d1.2 d5 c1.1 f4 d1.4 d6 c1.1 f5 d1.5 ccn@laptop:~$
Re: Trying to remove duplicate rows using hashes
by mje (Curate) on Oct 21, 2008 at 15:57 UTC

    I think you'll find you've at least got an error here:

    if(!$hash{$c_id}{$f_var}) { $hash{$c_id}{$f_var} = $d_id; }

    as I think a) that will always test true (I think you want to look at perldoc -f exists) and b) you are only going to get the first c_id/f_id and you didn't say that is what you wanted - you said remove duplicates.

      I think that you are wrong about both of those. Did you test it?

      Yes, its duplicates I would like removed.
Re: Trying to remove duplicate rows using hashes
by swampyankee (Parson) on Oct 21, 2008 at 16:27 UTC

    On a machine with a decent set of command line tools, you could do something like

    sort -u file > sorted_file_without_duplicates

    where file has your data.

    As kyle pointed out, this suggestion doesn't meet the OP's needs. Sorry, and please disregard.


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

      The OP is not trying to remove identical lines but rather lines that have two of four fields equivalent. In the example given, the lines removed differ in the first field, so "sort -u" would not remove them.