Optimize code | remove duplicates from AOA.

ashish.kvarma has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I was looking at some code (to get unique AOA) which I though was not very efficient. A simplified version of this code is given below.
(Please note that only index 1 and 3 need to be unique, also the result should be in same order)

use strict;
use warnings;
use Data::Dumper;

# Only the combination of index 1 and 3 needs to be unique.
my $AOA = [
    [
        1,
        219.00,  # Need to check this element for unique
        'ABC',
        'Abcdefghijklmnopqrstuvwxyz',  # Need to check this element fo
+r unique
        'def',
        '',
    ],
    [
        2,
        219.00,
        'ABC',
        'Abcdefghijklmnopqrstuvwxyz',
        'un',
        'we',
    ],
    [
        3,
        209.20,
        'ABC',
        'AbcdefghijklmnopqrstuvwxyzAbcdefghijk',
        'udf',
        '',
    ],
    [
        4,
        209.20,
        'ABC',
        'Abcdefghijklmnopqrstuvwxyz',
        'und',
        '',
    ],
];

my $max = $#$AOA;
my @matched_indexes;
my $count = 0;
for (my $j=0;$j <= $max ;$j++) {
    for (my $i=$j+1;$i <= $max;$i++) {
        if(($AOA->[$i][1] == $AOA->[$j][1]) && ($AOA->[$i][3] eq $AOA-
+>[$j][3])) {
            $matched_indexes[$count] = $i;
            $count++;
        }
    }
}

foreach my $index(@matched_indexes) {
    $AOA->[$index][1] = undef;
    $AOA->[$index][3] = undef;
}

my @new_AOA;
#Deleting the array elements if undef
for(my $index=0;$index<=$max;$index++) {
    if ((defined $AOA->[$index][1]) && (defined $AOA->[$index][3])) {
        push(@new_AOA, $AOA->[$index]);
    }
}
# Duplicate records removed
print Dumper \@new_AOA;
# Expected Output
#$VAR1 = [
#          [
#            1,
#            219,
#            'ABC',
#            'Abcdefghijklmnopqrstuvwxyz',
#            'def',
#            ''
#          ],
#          [
#            3,
#            '209.2',
#            'ABC',
#            'AbcdefghijklmnopqrstuvwxyzAbcdefghijk',
#            'udf',
#            ''
#          ],
#          [
#            4,
#            '209.2',
#            'ABC',
#            'Abcdefghijklmnopqrstuvwxyz',
#            'und',
#            ''
#          ]
#        ];
[download]

I tired a few solutions to make it more efficient, the best I could think was:

my %HOA;
my $count = 0;
foreach my $A (@$AOA) {
    my $key = $A->[1] . '-' . $A->[3];
    if (!defined $HOA{$key}) {
        $HOA{$key} = [$A, $count++];
    }
}

# Records need to be in the same sequence
my @new_AOA = sort { $a->[1] <=> $b->[1] }values %HOA;
@new_AOA = map {$_->[0]} @new_AOA;
print Dumper \@new_AOA;
[download]

Above solution gets rid of nested loop, though added new loops in form of sort and map (Which I need to maintain the same sequence as in original AOA).
Other thing which I could see is the length of keys of hash as the index 3 can have very large string. I am not sure if large keys could be any issue.

Please suggest any improvements to the above code.
Thanks in advance.

Update:Fixed some typo

Regards,
Ashish

Comment on Optimize code \| remove duplicates from AOA. Select or Download Code

Replies are listed 'Best First'.
Re: Optimize code \| remove duplicates from AOA. by moritz (Cardinal) on Aug 19, 2010 at 09:07 UTC
Step back. Ask yourself if your code is actually too slow for your purpose. Only if the answer is "yes", continue. Before you optimize, benchmark your code. After you did an optimization, benchmark again. Did it get faster? If not, revert. After each optimization step, ask yourself if the code is now fast enough for your purposes. If yes, stop. Use a profiler to find the slow parts of your program. After this section on general optimization technique, I'll now propose a hash-based, on-pass solution. You tell me if it's faster or not. (I can't know, because I don't have real test data; four arrays aren't enough to do any serious performance analysis). `my %uniq; my @new_AOA; foreach my $A (@$AOA) { my $key = $A->[1] . '-' . $A->[3]; push @new_AOA, $A unless $uniq{$key}++ }` [download] Using grep instead of the loop might or might not be faster. Update: grep version: `my %uniq; my @new_AOA = grep { !( $uniq{$A->[1] . '-' . $A->[3]}++ ) } @AOA;` [download] (Untested, but you get the idea). And finally it's a good idea to name variables by what they contain, not by the structure of what they contain. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: Optimize code \| remove duplicates from AOA. by jonc (Beadle) on Jul 09, 2011 at 16:37 UTC
Not entirely sure, but (just for completeness) is your grep version supposed to sub the `$A` with `$_` to eliminate the for loop and replace it with the grep? So final, quickest version: `my %uniq; my @new_AOA = grep { !( $uniq{$_->[1] . '-' . $_->[3]}++ ) } @AOA;` [download]	[reply] [d/l] [select]
Re: Optimize code \| remove duplicates from AOA. by jethro (Monsignor) on Aug 19, 2010 at 09:09 UTC
You are nearly there with the hash idea, but missed a simplification. Just step through your AoA in one loop and store the keys in a separate hash. If you find a duplicate, delete the Array entry (or whatever else you want to do with duplicates): `my %found; my $i=0; while ($i<@$AOA) { my $key = $A->[1] . '-' . $A->[3]; if (exists $found{$key}) { splice(@$AOA,$i,1); } else { $found{$key}=1; $i++; } }` [download] (Untested code). If you need to do some processing of the deleted item, change above line to `my $deleteditem=splice(@$AOA,$i,1);`. UPDATE: Thanks go to moritz for detecting that I forgot to increment $i. Corrected	[reply] [d/l] [select]
Re^2: Optimize code \| remove duplicates from AOA. by moritz (Cardinal) on Aug 19, 2010 at 09:31 UTC
I doubted that repeated splice on a long array is efficient (because arrays are stored as continuous memory in perl, and splice requires copying the remainder of the array), so wrote a small benchmark: `use 5.010; my @a = map int(10 * rand), 1..1000; sub d_grep { my @copy = @a; @copy = grep { $_ % 2 == 0 } @copy; } sub d_splice { my @copy = @a; my $i = 0; while ($i < @copy) { if ( $copy[$i] % 2 == 1 ) { splice @copy, $i, 1 } else { $i++; } } } use Benchmark qw(cmpthese); cmpthese -2, { grep => \&d_grep, splice => \&d_splice, }; __END__ Rate splice grep splice 2003/s -- -49% grep 3900/s 95% --` [download] I intentionally used in-place edits in both cases, so that the two implementations remain comparable. If you find some flaws in the benchmark, please let me known - writing benchmarks isn't my particular strength :-) For longer arrays the difference in speed grows. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re^3: Optimize code \| remove duplicates from AOA. by jethro (Monsignor) on Aug 19, 2010 at 10:39 UTC
May I answer this with your own words ;-) : Ask yourself if your code is actually too slow for your purpose. Only if the answer is "yes", continue. Perhaps I should mention that I interpreted his 'efficiency' more in terms of efficient coding than speed (which could be the wrong interpretation)	[reply]