in reply to better union of sets algorithm?

If they're sorted in advance and you have a constant number of lists, you should be able to do it in linear time (instead of the n lg n that using a hash provides). The basic technique is to think of the lists of queues. Now search through all of these queues and find the lowest item at the queue head. Print it. If there are any identical items, they will also be at the queue heads; remove all of these items. Now find the next lowest item, and repeat.

Update: This assumes you can get the lists sorted for free, or at least for cheaper than the n lg n it would take to process the hashes. If that's the case, here's some simple code that demonstrates the idea, using an object to make the idea clearer:

#!/usr/bin/perl use warnings; use strict; my @a1 = sort qw(foo bar blarn schmark floogle foo blarn); my @a2 = sort qw(flirp schnirp blarn floogle florn flimple flange); my @a3 = sort qw(bar bar floogle bar florn bar bar); my $sml = SortedMultiList->new(\@a1,\@a2,\@a3); my $last; while (defined(my $next = $sml->shift_lowest)) { print $next,"\n" if (!defined($last) or ($last ne $next)); $last = $next; } package SortedMultiList; sub new { my $class = shift; my $self = [@_]; bless $self, $class; } sub shift_lowest { my $self = shift; my($lowest_arr,$lowest); foreach my $a (@$self) { next unless (@$a); if (!defined($lowest) or ($a->[0] lt $lowest)) { $lowest_arr = $a; $lowest = $a->[0]; } } return $lowest_arr ? shift(@$lowest_arr) : undef; }

Replies are listed 'Best First'.
Re^2: better union of sets algorithm?
by blokhead (Monsignor) on Mar 11, 2005 at 06:18 UTC
    You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small). If you have k sorted lists, pop the first item off each and put them into a heap, with some indication of which list they came from. Take off the lowest item off the heap, if it's not a duplicate of the item we last saw, put it into the output list. Pop the list this item came from and put it into the heap. Repeat.

    This gives you an O(n log k) mergesort of k lists (where n is the total number of elements in all the lists). Blindly searching through the first elements of each list as you suggest gives an O(nk) algorithm. Of course, when the number of lists is constant, they are both still O(n), but the heap version will have a lower constant (unless k is very small).

    What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.

    I'm afraid perrin won't be able to better than the linear time algorithm he presents (other than small tweaks of the constant, which he explicitly expressed disinterest in). Think about it, how could you remove duplicates in the general case without at least looking at each element? Yes, there are surely some obscene special cases where you might do better at first, but it's a safe bet that you're going to copy the result into an array somewhere which takes O(n) anyway...

    blokhead

      You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
      Yes, but it's much harder to code off the top of my head. :)

      What I fail to understand is why you say that perrin's hash algorithm takes O(n log n) time. No, a hash dereferencing operation takes amortized constant time, and you are just doing a dereference for each element in each list. That's a linear time algorithm.
      My understanding has always been that hash operations over n items take lg n time, since in general a hash will divide up n items into n / lg n buckets, then do a linear search through the appropriate bucket.

      Some experiments seem to bear this out: as I increase the number of elements in a hash, the number of operations per second I can do decreases. This wouldn't happen if hash operations took constant time.

      100000: 446368.754 inserts/sec, 1050330.051 lookups/sec 200000: 424977.633 inserts/sec, 821176.759 lookups/sec 400000: 398651.394 inserts/sec, 805369.896 lookups/sec 800000: 399463.754 inserts/sec, 787853.768 lookups/sec 1600000: 390381.775 inserts/sec, 766870.015 lookups/sec 3200000: 377860.715 inserts/sec, 720909.739 lookups/sec

      Then again, my recollections of algorithmic amortizations are hazy at best, so feel free to convince me that I'm wrong. :)

        IMO what you are seeing there is NOT lookup time, but rather effects such as cache spoil and hash remapping. Every time the densitiy of the hash goes above a certain level the number of buckets in the hash are doubled and all of the elements are remapped. This has a cost that shows up in the type of benchamrk that you are doing. You would see most of this effect disappear by presizing the hash before loading it. The reduction in run time that is the consequence of cache spoil is basically impossible to eliminate. Basically once the hash gets over a certain size you are more or less guaranteed that a hash lookup will spoil cache, eliminating benefits that would be had were you able to store the whole hash inside of the cache.

        Also its worth remembering that every time you use a unique key in _any_ hash _anywhere_ that key will be inserted into a second hash structure that is used to store the keys. This means that the second hash must undergo scaling as well. AFAIK, amortized the cost of hash storage or lookup is actually constant. BTW, I believe later perls incorporate a mechanism by which degenerate bucket mappings are avoided, from what I understand its unusual to get more than two elements to a bucket and more commonly you wont see more than one.

        ---
        demerphq

        Yes, this looks very much like a logarithmic increase, doesn't it? And it jives with what you say about using n/log(n) buckets. If there are n items in the hash and m buckets, then the amortized cost of any hash operation is O(1+n/m). If indeed Perl is using n/log(n) buckets, this makes all hash operations logarithmic.

        I'm pretty sure this implies that if you scale the number of buckets proportional to n (instead of n/log(n)), you will get constant-time operations. But perhaps something yucky happens when you try to implement such a creature. Or perhaps I am confusing hashes with other amortized structures (where I know for a fact that certain operations are constant-time). In either case, the empirical evidence suggests I should retract my assertion about constant amortized hashes in Perl. Is there an internals geek around who can shed light on the actual implementation used?

        blokhead

        You can do slightly better than what you just suggested for a mergesort of k lists (when k is not super-small).
        Yes, but it's much harder to code off the top of my head. :)
        From perldoc sort:
                   use sort 'stable';          # guarantee stability
                   use sort '_quicksort';      # use a quicksort algorithm
                   use sort '_mergesort';      # use a mergesort algorithm
                   use sort 'defaults';        # revert to default behavior
                   no  sort 'stable';          # stability not important
        
        
        
        I'm not convinced. Your numbers, certainly not your insertion numbers, don't look like O(n lg n) to me. But your benchmark has some flaws as well. First, it's using wall-clock times, so anything else on the machine running skews the results. Second, the key size does have some influence on the insert/query times. The average key length of keys 1 .. 100_000 is 4.89, but for keys 1 .. 3_200_000, the average key length is 6.65.

        Here's a benchmark that uses the same key lengths, and that's using times instead of time. I also stopped after 1_500_000 keys, to avoid having to swap skew the results.

        my @sizes = map {$_* 100_000} 1 .. 15; foreach my $size (@sizes) { my %h; my $from = sprintf "%010d", 1; my $to = sprintf "%010d", $size; my @start = times; foreach my $k ($from .. $to) { $h{$k} = $k; } my @end = times; my $diff = ($end[0] + $end[1]) - ($start[0] + $start[1]); my $ratio = $size / $diff; printf "%10d: %.3f inserts/sec;", $size, $ratio; @start = times; foreach my $k ($from .. $to) { $k == $h{$k} or die; } @end = times; $diff = ($end[0] + $end[1]) - ($start[0] + $start[1]); $ratio = $size / $diff; printf " %.3f queries/sec\n", $ratio; } __END__ 100000: 357142.857 inserts/sec; 625000.000 queries/sec 200000: 392156.863 inserts/sec; 526315.789 queries/sec 300000: 291262.136 inserts/sec; 434782.609 queries/sec 400000: 476190.476 inserts/sec; 459770.115 queries/sec 500000: 458715.596 inserts/sec; 442477.876 queries/sec 600000: 337078.652 inserts/sec; 447761.194 queries/sec 700000: 457516.340 inserts/sec; 429447.853 queries/sec 800000: 441988.950 inserts/sec; 425531.915 queries/sec 900000: 434782.609 inserts/sec; 410958.904 queries/sec 1000000: 427350.427 inserts/sec; 404858.300 queries/sec 1100000: 317002.882 inserts/sec; 416666.667 queries/sec 1200000: 433212.996 inserts/sec; 408163.265 queries/sec 1300000: 420711.974 inserts/sec; 398773.006 queries/sec 1400000: 425531.915 inserts/sec; 402298.851 queries/sec 1500000: 394736.842 inserts/sec; 364963.504 queries/sec
        Not something I'd call O(n lg n).
Re^2: better union of sets algorithm?
by Anonymous Monk on Mar 11, 2005 at 10:14 UTC
    instead of the n lg n that using a hash provides
    You are wrong here. The expected amortized time for hash operations is O(k), where k is the length of the string (it's the time that's needed to compute the hash value). Note the absense of a dependency on the size of a hash. This means that if you do N hash operations (insertions, deletions, fetches) and all your strings are bounded in length by some constant, your expected time is O(N).

    Now, an individual insertion in a hash may take O(n) (where n is the size of the hash). But this typically happens only after Ω(n) insertions (due to the hash filling up and a new hash needs to be build), so that amortizes out.

    An individual search may take Ω(n) time because a fraction of your keys hash to the same bucket. That would lead to a worse case quadratic performance. But the chances of that happening are so small, it doesn't have an impact on the expected running times. And ever since 5.8.1, Perl has some safety mechanisms (turned on by default) that even prevents you from constructing an input sequence that hashes the keys to the same bucket, because that hash function uses a different values on each run.

    So, for all practical purposes, hash operations are done in constant time.