in reply to Re: Contemplating some set comparison tasks
in thread Contemplating some set comparison tasks

Generate your sample data!

#! /usr/bin/perl open(RND, '<', '/dev/urandom') or die; sub id { local $/ = \20; unpack 'H40', <RND> } # random ID sub nk { int 1000**(1+rand 42)**-1 } # shaped rnd my @S = map id, 1 .. 20_000; do { print "$_|$S[rand@S]\n" for (id) x nk } for 1 .. 30_000_000;

Replies are listed 'Best First'.
Re^3: Contemplating some set comparison tasks
by Laurent_R (Canon) on Aug 10, 2014 at 09:14 UTC
    Thanks for the suggestion, I had thought about it but I quickly dismissed the idea because the efficiency of the suggested algorithm relies to a very large extent on the shape of the data. Randomly generated data might behave completely differently from actual data, and we don't even know if it would be for the better or for the worse.

    For example, the standard quick sort algorithm behaves well on random generated data (with an average O(n log n) complexity) but has a bad worst case complexity (O(nē)); it behaves particularly poorly on almost sorted data. Heap sort and merge sort behave reasonably well in all cases (O(n log n)). Other algorithms such as bubble sort or insertion sort have low performance on random data (O(nē)) but exhibit an extremely good best case (O(n)).

    With the actual problem described by dwhite20899, I would probably first run the initial elimination phase I suggested (removing all singletons, i.e. keys that appear only once, plus all the keys having the same source as those singletons, plus possibly duplicates of the keys eliminated because of the source), and then look closely at the remaining data to figure out the best way to go. Depending on the actual data shape, there may still be at this point 6.2 million lines (if the elimination through the source did not result in any elimination of additional keys) or may be only 500k, or may be even much much less. And the remaining data may have a lot of new singletons, making the second part of the algorithm efficient, or maybe not.

    In brief, using randomly generated data can't tell us anything about the performance of the suggested algorithm on the real data.