comment on

<nitpick> Actually, rather than forgetting to specify that the arrays are expected to contain only integers in the single-decimal-digit range (>0, <10), what you said was:

I now have two identical arrays, even if the sort is ascii-betical instead of true numerical.

which sort of implies that you would expect to see some integers >9 (still assuming we are only talking about positive integers).

BTW, how big are these arrays, and should we expect repeated values? If there are repeated values, when you say "test if @array_1 contains exactly the same set of integers as @array_2" do you mean "the same quantities of elements for each observed value", or simply "the same values present, any number of times"? </nitpick>

You've heard about the Benchmark module, right? Have you tried that with "join" vs. something else to conclude that "join" is "heavy"?

As always with this sort of problem, hashes come to mind, but you would need Benchmark to see how it compares to sorting and stringifying.

Here's a test of hash vs. sort-join-string-compare vs. sort-iterate-numeric-compare, checking results for both "same" and "diff" data sets (requiring that repeated values appear with the same quantity in order to be "same"), with options to change array size and max value in the array:

#!/usr/bin/perl

use strict;
use Benchmark qw/cmpthese/;
use List::Util qw/shuffle/;

# ARGV[0], if present, sets array size
# ARGV[1], if present, sets max value of integer range

my $nelem = 9;
my $upper = 9;
$nelem = shift if ( @ARGV and $ARGV[0] =~ /^\d+$/ );
$upper = shift if ( @ARGV and $ARGV[0] =~ /^\d+$/ );
die "Usage:  $0  array_size  max_value\n (min_value is always 1)\n"
    if ( @ARGV );

my ( @array1, @array2 );
push @array1, int(rand($upper))+1 for ( 1..$nelem );
push @array2, int(rand($upper))+1 for ( 1..$nelem );

my %answer;
cmpthese( 100_000,
          {
           'hash' => \&comp_by_hash,
           'sortcomp' => \&comp_by_sortcomp,
           'sortjoin' => \&comp_by_sortjoin,
          } );

print "\nresults: ",join(" ",map{"$_:$answer{$_}"} keys %answer),"\n\n
+";

@array2 = shuffle( @array1 );
cmpthese( 100_000,
          {
           'hash' => \&comp_by_hash,
           'sortcomp' => \&comp_by_sortcomp,
           'sortjoin' => \&comp_by_sortjoin,
          } );

print "\nresults: ",join(" ",map{"$_:$answer{$_}"} keys %answer),"\n\n
+";

sub comp_by_hash {
    my %h;
    $h{$_}++ for ( @array1 );
    $h{$_}-- for ( @array2 );
    $answer{hash} = ( grep {$h{$_}!=0} keys %h ) ? 'diff' : 'same';
}
sub comp_by_sortcomp {
    my @sort1 = sort @array1;
    my @sort2 = sort @array2;
    for ( 0..$#sort1 ) {
        if( $sort1[$_] != $sort2[$_] ) {
            $answer{sortcomp} = 'diff';
            return;
        }
    }
    $answer{sortcomp} = 'same';
}
sub comp_by_sortjoin {
    my $j1 = join ' ', sort( @array1 );
    my $j2 = join ' ', sort( @array2 );
    $answer{sortjoin} = ( $j1 eq $j2 ) ? 'same':'diff';
}
[download]

And here are the timing results (and output) for a "default" run (array size: 9, max value: 9, running on a 1GHz G4 PowerPC, macosx 10.4.10):

$ test.pl
            Rate sortcomp     hash sortjoin
sortcomp 20202/s       --      -0%     -58%
hash     20243/s       0%       --     -58%
sortjoin 48077/s     138%     137%       --

results: hash:diff sortjoin:diff sortcomp:diff

            Rate sortcomp     hash sortjoin
sortcomp 17857/s       --     -26%     -63%
hash     24213/s      36%       --     -50%
sortjoin 48544/s     172%     100%       --

results: hash:same sortjoin:same sortcomp:same
[download]

As you would expect, if we grow the array size but keep to the same limited number of possible array values, the "sortjoin" method will suffer more than the hash method (due to having to build and compare longer strings):

test.pl 90  ## array size = 90 (lots of repeat values)
           Rate sortcomp sortjoin     hash
sortcomp 2199/s       --     -53%     -61%
sortjoin 4645/s     111%       --     -18%
hash     5653/s     157%      22%       --

results: hash:diff sortjoin:diff sortcomp:diff

           Rate sortcomp sortjoin     hash
sortcomp 1907/s       --     -59%     -66%
sortjoin 4684/s     146%       --     -17%
hash     5659/s     197%      21%       --

results: hash:same sortjoin:same sortcomp:same
[download]

But overall, I think this is an area where benchmarking, while fun and interesting, is really a bit pointless. Unless this particular task really occupies a huge proportion of what your application is supposed to do, the difference in "efficiency" among these different approaches is likely to be drowned out by everything else the app actually does (file/db/network i/o, etc).

In reply to Re^3: Compare two arrays of simple numbers by graff
in thread Compare two arrays of simple numbers by punch_card_don

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.