flounder has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I know what I need to do, but just am not quite sure how to do it. The below code works, but can become very slow for large sets of data. Basically, the problem is, I can have multiple sets of data, and want to compare between all sets, but not within the set itself. An example will lend a hand in my poor explanation:
@array = ([1,2,3],[5,5,6],[1,4,6]);
In this case, I would like to pick out 1 and 6 as duplicates through the sets of data. However, 5 is not a duplicate, because I don't care about the duplicates within a set. I think the fastest way to do this is probably using the exists function with hashes, but I'm not quite sure how to do it effeciently. My sad attempt at doing this(coming from a C background) is shown below. Like I said, it works, but it's very time consuming due to the nested loops. Thanks in advance to all help!
my $loop_index = 0; for $comma_array (@current_case_commas) { for my $comma (@$comma_array) { if($comma != "") { for (my $i=0; $i < $loop_index; $i++) { my @other_cases_array = @{$current_case_commas[$i]}; for my $other_cases (@other_cases_array) { if($other_cases = "") { if ($comma == $other_cases) { $final_case_values_duplicate{$comma} = +(); } } } } for (my $i=$loop_index+1; $i <= $#current_case_commas; +$i++) { my @other_cases_array = @{$current_case_commas[$i]}; for my $other_cases (@other_cases_array) { if($other_cases != "") { if ($comma == $other_cases) { $final_case_values_duplicate{$comma} = +(); } } } } } } $loop_index+=1; }
Oh, and one more thing now that I see the code. The lines where I have to check for a null or space in the array. I am not quite sure how they got there, and would like to get rid of them all together. The way I enter things into the array is:
push @array, split(/[\s|\t]+/,$list_of_num);
Where the list is a space separated list of numbers. I am not sure why(since I'm splitting on spaces/tabs) a blank would get into the array in the first place. Thanks again for putting up with another newbie, and all your help.

Replies are listed 'Best First'.
Re: Comparing between multiple sets of data
by Masem (Monsignor) on Jun 22, 2001 at 21:41 UTC
    A better way to do it is collect what you need first, then do the comparison, that is:
    • Make a hash
    • For each element in each list, add the list # to the hash for that element if not already on there
    • For each key of the hash, if the list has more than 1 item, it's duplicated.
    Programmically, it would look like:
    my %hash; foreach my $i ( 0..$#array ) { foreach my $j ( @{ $array[ $i ] } ) { $hash{ $j } ||= []; #need to set this if not made if ( ! grep { $i == $_ } @{ $hash{ $j } } ) { push @{ $hash{ $j } }, $i; } } } my @matches = grep { @{ $hash{ $_ } } > 1 } keys %hash;
    This is probably close to being on the order of N^2 (N being the total number of elements in the entire 2d structure), but will be closer to N as the number of possible duplicates drop.

    Note that if you wish to exclude elements, you can simply add an "if (something) next" line in the deepest block, or do an additional grep on the @{ $array } in the second foreach block to remove elements you don't care about. But it's probably faster to look at everything, then drop what you don't need then checking every time.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Comparing between multiple sets of data
by mirod (Canon) on Jun 22, 2001 at 21:47 UTC

    This seems to work:

    #!/bin/perl -w use strict; my @array = ([1,2,3],[5,5,6],[1,4,6]); # nb => number of arrays in which the number is found my %in_array; foreach my $array (@array) { # the keys of %unique_for_array are the (unique) numbers in the ar +ray my %unique_for_array= map { $_, 1} @$array; # now use those unique numbers to count $in_array{$_}++ foreach keys %unique_for_array; } # get the duplicates my @duplicates= grep { $in_array{$_} > 1 } keys %in_array; print "duplicates: ", join( ', ', @duplicates), "\n";
Re: Comparing between multiple sets of data
by maverick (Curate) on Jun 22, 2001 at 22:57 UTC
    Granted that others have already posted on this, but here's my take on it. :)

    First, I'd do the split as "split(/\s+/,$list_of_num)". \s is 'whitespace' which includes ' ', \t (tab), \f (form feed), \r (carrage return) and, \n (newline).

    If you don't need the original inner lists in the form they were read in, why not make them a hash first? this would save you a traversal of the list to make a hash of it for the uniqe-ing part.

    use strict; use Data::Dumper; my @input = ("1 2 3", "5 5 6", "1 4 6" ); # outer map: take each input line # inner map: split it on white space and return it as a hash ref my @array = map { { map { $_ => 1 } split(/\s+/,$_) } } @input; print Dumper @array; my %unique; for (my $i=0; $i<=$#array; $i++) { foreach (keys %{$array[$i]}) { $unique{$_}->{'pos'}->{$i} = 1; $unique{$_}->{'count'}++; } } my @duplicates = grep { $unique{$_}->{'count'} > 1 } keys %unique; foreach (@duplicates) { print "DUP: $_ In lists: "; print join(",",keys %{$unique{$_}->{'pos'}}); print "\n"; }
    This is essentially the same idea as others have posted, execpt this keeps track of which sets each duplicate was found in.

    /\/\averick