in reply to Re: better union of sets algorithm?
in thread better union of sets algorithm?

You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small). If you have k sorted lists, pop the first item off each and put them into a heap, with some indication of which list they came from. Take off the lowest item off the heap, if it's not a duplicate of the item we last saw, put it into the output list. Pop the list this item came from and put it into the heap. Repeat.

This gives you an O(n log k) mergesort of k lists (where n is the total number of elements in all the lists). Blindly searching through the first elements of each list as you suggest gives an O(nk) algorithm. Of course, when the number of lists is constant, they are both still O(n), but the heap version will have a lower constant (unless k is very small).

What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.

I'm afraid perrin won't be able to better than the linear time algorithm he presents (other than small tweaks of the constant, which he explicitly expressed disinterest in). Think about it, how could you remove duplicates in the general case without at least looking at each element? Yes, there are surely some obscene special cases where you might do better at first, but it's a safe bet that you're going to copy the result into an array somewhere which takes O(n) anyway...

blokhead

Replies are listed 'Best First'.
Re^3: better union of sets algorithm?
by sgifford (Prior) on Mar 11, 2005 at 06:50 UTC
    You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
    Yes, but it's much harder to code off the top of my head. :)

    What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.
    My understanding has always been that hash operations over n items take lg n time, since in general a hash will divide up n items into n / lg n buckets, then do a linear search through the appropriate bucket.

    Some experiments seem to bear this out: as I increase the number of elements in a hash, the number of operations per second I can do decreases. This wouldn't happen if hash operations took constant time.

    100000: 446368.754 inserts/sec, 1050330.051 lookups/sec 200000: 424977.633 inserts/sec, 821176.759 lookups/sec 400000: 398651.394 inserts/sec, 805369.896 lookups/sec 800000: 399463.754 inserts/sec, 787853.768 lookups/sec 1600000: 390381.775 inserts/sec, 766870.015 lookups/sec 3200000: 377860.715 inserts/sec, 720909.739 lookups/sec

    Then again, my recollections of algorithmic amortizations are hazy at best, so feel free to convince me that I'm wrong. :)

      IMO what you are seeing there is NOT lookup time, but rather effects such as cache spoil and hash remapping. Every time the densitiy of the hash goes above a certain level the number of buckets in the hash are doubled and all of the elements are remapped. This has a cost that shows up in the type of benchamrk that you are doing. You would see most of this effect disappear by presizing the hash before loading it. The reduction in run time that is the consequence of cache spoil is basically impossible to eliminate. Basically once the hash gets over a certain size you are more or less guaranteed that a hash lookup will spoil cache, eliminating benefits that would be had were you able to store the whole hash inside of the cache.

      Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys. This means that the second hash must undergo scaling as well. AFAIK, amortized the cost of hash storage or lookup is actually constant. BTW, I believe later perls incorporate a mechanism by which degenerate bucket mappings are avoided, from what I understand its unusual to get more than two elements to a bucket and more commonly you wont see more than one.

      ---
      demerphq

        Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys.

        No, buckets are linked lists, not secondary hashes.

      Yes, this looks very much like a logarithmic increase, doesn't it? And it jives with what you say about using n/log(n) buckets. If there are n items in the hash and m buckets, then the amortized cost of any hash operation is O(1+n/m). If indeed Perl is using n/log(n) buckets, this makes all hash operations logarithmic.

      I'm pretty sure this implies that if you scale the number of buckets proportional to n (instead of n/log(n)), you will get constant-time operations. But perhaps something yucky happens when you try to implement such a creature. Or perhaps I am confusing hashes with other amortized structures (where I know for a fact that certain operations are constant-time). In either case, the empirical evidence suggests I should retract my assertion about constant amortized hashes in Perl. Is there an internals geek around who can shed light on the actual implementation used?

      blokhead

      You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
      Yes, but it's much harder to code off the top of my head. :)
      From perldoc sort:
                 use sort 'stable';          # guarantee stability
                 use sort '_quicksort';      # use a quicksort algorithm
                 use sort '_mergesort';      # use a mergesort algorithm
                 use sort 'defaults';        # revert to default behavior
                 no  sort 'stable';          # stability not important
      
      
      
      I'm not convinced. Your numbers, certainly not your insertion numbers, don't look like O(n lg n) to me. But your benchmark has some flaws as well. First, it's using wall-clock times, so anything else on the machine running skews the results. Second, the key size does have some influence on the insert/query times. The average key length of keys 1 .. 100_000 is 4.89, but for keys 1 .. 3_200_000, the average key length is 6.65.

      Here's a benchmark that uses the same key lengths, and that's using times instead of time. I also stopped after 1_500_000 keys, to avoid having to swap skew the results.

      my @sizes = map {$_* 100_000} 1 .. 15; foreach my $size (@sizes) { my %h; my $from = sprintf "%010d", 1; my $to = sprintf "%010d", $size; my @start = times; foreach my $k ($from .. $to) { $h{$k} = $k; } my @end = times; my $diff = ($end[0] + $end[1]) - ($start[0] + $start[1]); my $ratio = $size / $diff; printf "%10d: %.3f inserts/sec;", $size, $ratio; @start = times; foreach my $k ($from .. $to) { $k == $h{$k} or die; } @end = times; $diff = ($end[0] + $end[1]) - ($start[0] + $start[1]); $ratio = $size / $diff; printf " %.3f queries/sec\n", $ratio; } __END__ 100000: 357142.857 inserts/sec; 625000.000 queries/sec 200000: 392156.863 inserts/sec; 526315.789 queries/sec 300000: 291262.136 inserts/sec; 434782.609 queries/sec 400000: 476190.476 inserts/sec; 459770.115 queries/sec 500000: 458715.596 inserts/sec; 442477.876 queries/sec 600000: 337078.652 inserts/sec; 447761.194 queries/sec 700000: 457516.340 inserts/sec; 429447.853 queries/sec 800000: 441988.950 inserts/sec; 425531.915 queries/sec 900000: 434782.609 inserts/sec; 410958.904 queries/sec 1000000: 427350.427 inserts/sec; 404858.300 queries/sec 1100000: 317002.882 inserts/sec; 416666.667 queries/sec 1200000: 433212.996 inserts/sec; 408163.265 queries/sec 1300000: 420711.974 inserts/sec; 398773.006 queries/sec 1400000: 425531.915 inserts/sec; 402298.851 queries/sec 1500000: 394736.842 inserts/sec; 364963.504 queries/sec
      Not something I'd call O(n lg n).