in reply to Re^2: better union of sets algorithm?
in thread better union of sets algorithm?

You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
Yes, but it's much harder to code off the top of my head. :)

What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.
My understanding has always been that hash operations over n items take lg n time, since in general a hash will divide up n items into n / lg n buckets, then do a linear search through the appropriate bucket.

Some experiments seem to bear this out: as I increase the number of elements in a hash, the number of operations per second I can do decreases. This wouldn't happen if hash operations took constant time.

#!/usr/bin/perl use warnings; use strict; use Time::HiRes qw(time); my %h; foreach my $size (100_000, 200_000, 400_000, 800_000, 1_600_000, 3_200 +_000) { my %h; my($start,$end); # Test insert times $start = time; foreach my $i (1..$size) { $h{$i} = $i; } $end = time; my $insert_time = $end - $start; # Now test lookup times $start = time; foreach my $i (1..$size) { $h{$i} == $i or die "what the hell?"; } $end = time; my $lookup_time = $end - $start; printf "%10d: %5.3f inserts/sec, %5.3f lookups/sec\n", $size, $size/$insert_time, $size/$lookup_time; }
100000: 446368.754 inserts/sec, 1050330.051 lookups/sec 200000: 424977.633 inserts/sec, 821176.759 lookups/sec 400000: 398651.394 inserts/sec, 805369.896 lookups/sec 800000: 399463.754 inserts/sec, 787853.768 lookups/sec 1600000: 390381.775 inserts/sec, 766870.015 lookups/sec 3200000: 377860.715 inserts/sec, 720909.739 lookups/sec

Then again, my recollections of algorithmic amortizations are hazy at best, so feel free to convince me that I'm wrong. :)

Replies are listed 'Best First'.
Re^4: better union of sets algorithm?
by demerphq (Chancellor) on Mar 11, 2005 at 08:25 UTC

    IMO what you are seeing there is NOT lookup time, but rather effects such as cache spoil and hash remapping. Every time the densitiy of the hash goes above a certain level the number of buckets in the hash are doubled and all of the elements are remapped. This has a cost that shows up in the type of benchamrk that you are doing. You would see most of this effect disappear by presizing the hash before loading it. The reduction in run time that is the consequence of cache spoil is basically impossible to eliminate. Basically once the hash gets over a certain size you are more or less guaranteed that a hash lookup will spoil cache, eliminating benefits that would be had were you able to store the whole hash inside of the cache.

    Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys. This means that the second hash must undergo scaling as well. AFAIK, amortized the cost of hash storage or lookup is actually constant. BTW, I believe later perls incorporate a mechanism by which degenerate bucket mappings are avoided, from what I understand its unusual to get more than two elements to a bucket and more commonly you wont see more than one.

    ---
    demerphq

      Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys.

      No, buckets are linked lists, not secondary hashes.

        By default build perl shares keys accross all hashes. That way when you have a zillion hashes with the same keys the keys arent duplicated in memory. To store all those keys it uses another hash that really holds the keys. So when you insert a new key into your hash its hash number is calculated and then that number is used to store into both hashes. Which means that hash has to grow as well.

        ---
        demerphq

Re^4: better union of sets algorithm?
by blokhead (Monsignor) on Mar 11, 2005 at 07:22 UTC
    Yes, this looks very much like a logarithmic increase, doesn't it? And it jives with what you say about using n/log(n) buckets. If there are n items in the hash and m buckets, then the amortized cost of any hash operation is O(1+n/m). If indeed Perl is using n/log(n) buckets, this makes all hash operations logarithmic.

    I'm pretty sure this implies that if you scale the number of buckets proportional to n (instead of n/log(n)), you will get constant-time operations. But perhaps something yucky happens when you try to implement such a creature. Or perhaps I am confusing hashes with other amortized structures (where I know for a fact that certain operations are constant-time). In either case, the empirical evidence suggests I should retract my assertion about constant amortized hashes in Perl. Is there an internals geek around who can shed light on the actual implementation used?

    blokhead

Re^4: better union of sets algorithm?
by blazar (Canon) on Mar 11, 2005 at 09:09 UTC
    You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
    Yes, but it's much harder to code off the top of my head. :)
    From perldoc sort:
               use sort 'stable';          # guarantee stability
               use sort '_quicksort';      # use a quicksort algorithm
               use sort '_mergesort';      # use a mergesort algorithm
               use sort 'defaults';        # revert to default behavior
               no  sort 'stable';          # stability not important
    
    
    
Re^4: better union of sets algorithm?
by Anonymous Monk on Mar 11, 2005 at 10:49 UTC
    I'm not convinced. Your numbers, certainly not your insertion numbers, don't look like O(n lg n) to me. But your benchmark has some flaws as well. First, it's using wall-clock times, so anything else on the machine running skews the results. Second, the key size does have some influence on the insert/query times. The average key length of keys 1 .. 100_000 is 4.89, but for keys 1 .. 3_200_000, the average key length is 6.65.

    Here's a benchmark that uses the same key lengths, and that's using times instead of time. I also stopped after 1_500_000 keys, to avoid having to swap skew the results.

    my @sizes = map {$_* 100_000} 1 .. 15; foreach my $size (@sizes) { my %h; my $from = sprintf "%010d", 1; my $to = sprintf "%010d", $size; my @start = times; foreach my $k ($from .. $to) { $h{$k} = $k; } my @end = times; my $diff = ($end[0] + $end[1]) - ($start[0] + $start[1]); my $ratio = $size / $diff; printf "%10d: %.3f inserts/sec;", $size, $ratio; @start = times; foreach my $k ($from .. $to) { $k == $h{$k} or die; } @end = times; $diff = ($end[0] + $end[1]) - ($start[0] + $start[1]); $ratio = $size / $diff; printf " %.3f queries/sec\n", $ratio; } __END__ 100000: 357142.857 inserts/sec; 625000.000 queries/sec 200000: 392156.863 inserts/sec; 526315.789 queries/sec 300000: 291262.136 inserts/sec; 434782.609 queries/sec 400000: 476190.476 inserts/sec; 459770.115 queries/sec 500000: 458715.596 inserts/sec; 442477.876 queries/sec 600000: 337078.652 inserts/sec; 447761.194 queries/sec 700000: 457516.340 inserts/sec; 429447.853 queries/sec 800000: 441988.950 inserts/sec; 425531.915 queries/sec 900000: 434782.609 inserts/sec; 410958.904 queries/sec 1000000: 427350.427 inserts/sec; 404858.300 queries/sec 1100000: 317002.882 inserts/sec; 416666.667 queries/sec 1200000: 433212.996 inserts/sec; 408163.265 queries/sec 1300000: 420711.974 inserts/sec; 398773.006 queries/sec 1400000: 425531.915 inserts/sec; 402298.851 queries/sec 1500000: 394736.842 inserts/sec; 364963.504 queries/sec
    Not something I'd call O(n lg n).