DamnitAddie has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm working with an Excel sheet that's loaded into an array. I'm trying to remove duplicate rows from the array, based off matches on a specific field within that row. In this example, what I'd refer to as column 3, which is an account number.

Please forgive me, I am a novice at best. I've read a number of posts with some examples, but I'm not following the code because it doesn't seem to look at any specific value. A frequent example of the code suggested is:

@rows = do { my %seen; grep { !$seen{$_}++ } @rows };

I don't explicitly understand what's happening in that line, most importantly what it is matching to find a duplicate, but more importantly, it doesn't seem to work on my data.

If anyone has any suggestions, I would like to hear them. I'd also like to understand what the code is doing, rather than just cutting and pasting something. I want to learn, if possible.

Replies are listed 'Best First'.
Re: Removing duplicates in multi-dimensional arrays
by hippo (Archbishop) on Oct 03, 2018 at 13:18 UTC
    Please forgive me, I am a novice at best.

    No worries.

    it doesn't seem to work on my data.

    The problem here is that you haven't shown what your data is and you haven't shown in what way it "doesn't seem to work". This gives the people who want to help you very little to go on. An SSCCE would be ideal.

    That said, if you are using an AoA as @rows then you will have trouble. Maybe all you need is:

    @rows = do { my %seen; grep { !$seen{$_->[2]}++ } @rows };

    But without any data to test on, who can say?

    As for explaining the code, you are grepping an array for entries which don't match entries in a hash which you build up as you go along. This is the standard, simplest de-duplication technique (without resorting to modules) as outlined in the FAQ How can I remove duplicate elements from a list or array?.

      That actually worked. I apologize for not being more verbose with the data.

      In the examples that I played with, I was trying $_2 vs $_->2 and getting all blanks.

      I appreciate your reply.

        Your post is a little bit difficult to read because "$_2 vs $_->2" is rendered a bit confusingly without  <code> tags. Please see Writeup Formatting Tips and Markup in the Monastery. You've already employed code-tags to good effect in your OP; you only need to be more thorough in their use.

        As an aside: Do you understand the difference between $_[2] and $_->[2]?

        And please do read and follow the principles discussed in the SSCCE article.


        Give a man a fish:  <%-{-{-{-<

Re: Removing duplicates in multi-dimensional arrays
by atcroft (Abbot) on Oct 03, 2018 at 14:39 UTC

    I'm not sure that particular snippet is going to be directly usable for your purpose (although others may suggest ways to adapt it to do so), but in the interest of education I will try to explain what is going on. For the purpose of this example, assume @rows has the following initial content: [ 'a', 's', 'd', 'd', 'f', ]. (And if anyone notices any errors in the following, please point them out, so I do not lead someone else astray!)

    According to the docs for do:

    Not really a function. Returns the value of the last command in the sequence of commands indicated by BLOCK. When modified by the while or until loop modifier, executes the BLOCK once before testing the loop condition. (On other statements the loop modifiers test the conditional first.)

    And the docs for grep:

    This is similar in spirit to, but not the same as, grep(1) and its relatives. In particular, it is not limited to using regular expressions.

    Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value consisting of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true.

    1. my @foo = grep(!/^#/, @bar);    # weed out comments
    or equivalently,
    1. my @foo = grep {!/^#/} @bar;    # weed out comments
    Note that $_ is an alias to the list value, so it can be used to modify the elements of the LIST. While this is useful and supported, it can cause bizarre results if the elements of LIST are not variables. Similarly, grep returns aliases into the original list, much as a for loop's index variable aliases the list elements. That is, modifying an element of a list returned by grep (for example, in a foreach , map or another grep) actually modifies the element in the original list. This is usually something to be avoided when writing clear code.

    So what does this actually mean? Let's walk through it.

    1. do executes its block
      1. The hash %seen is declared as a local (lexical) variable.
      2. grep evaluates for $_ = 'a'. As there is no entry for 'a', !$seen{'a'} is !0 which is 1 (true), and 'a' will pass the grep test, but the '++' increments $seen{'a'} to 1.
      3. grep evaluates for $_ = 's'. As there is no entry for 's', !$seen{'s'} is !0 which is 1 (true), and 's' will pass the grep test, but the '++' increments $seen{'s'} to 1.
      4. grep evaluates for $_ = 'd'. As there is no entry for 'd', !$seen{'d'} is !0 which is 1 (true), and 'd' will pass the grep test, but the '++' increments $seen{'d'} to 1.
      5. grep evaluates for $_ = 'd'. $seen{'d'} is 1, so !$seen{'d'} is !1 which is 0 (false), and this instance of 'd' fails the grep test, but the '++' increments $seen{'d'} to 2.
      6. grep evaluates for $_ = 'f'. As there is no entry for 'f', !$seen{'f'} is !0 which is 1 (true), and 'f' will pass the grep test, but the '++' increments $seen{'f'} to 1.
    2. @rows is assigned the results of the do (the results of the grep on @rows); that is, [ 'a', 's', 'd', 'f', ]

    Hope that helps.

Re: Removing duplicates in multi-dimensional arrays
by BillKSmith (Monsignor) on Oct 03, 2018 at 20:32 UTC
    The code which you show is a common idiom which hides a lot of details. Here is an expanded version which should be easier to follow.
    $type Damnitaddie.pl use strict; use warnings; use Data::Dumper; my @rows = ( [qw(a b c d)], [qw(e f c h)], [qw(i j k l)], ); my %seen; @rows = grep { my $x = $_->[2]; if ( exists $seen{$x} ) { $seen{$x}++; 0; } else { $seen{$x} = 0; $seen{$x}++; 1; } } @rows; #@rows = do { my %seen; grep { !$seen{$_->[2]}++ } @rows }; print Dumper( \@rows ); $perl Damnitaddie.pl $VAR1 = [ [ 'a', 'b', 'c', 'd' ], [ 'i', 'j', 'k', 'l' ] ];
    Bill