Do you have an algorithm in mind?
To quote myself above quoting myself above:
I'd expect a modified sorting algorithm that eliminates duplicates whenever
they "touch".
Most sorting algorithms boil down to steps of "compare two records" followed
by "'move' one of the records". If two records compare equal, don't move one
of them, remove one of them. The details of what "remove" means depends on
the underlying sort algorithm and data structures being used. Exercise your
basic programing skills for each case.
For an insertion sort, "remove" just means "don't insert".
For quicksort, you add a third type of swap that moves a duplicate pivot to
the end of the current parition and you output a fully sorted (and possibly
shortened) partition as soon as you finish with it.
For in-memory, in-place merge sort: When comparing two items, if they are
equal, just decrement the current end pointer, leaving a "don't care" item on
the end of that partition. When merging, just skip duplicates and "don't
care"s, which leaves a buffer of "don't care" items on the end of the larger
partition. When done, the sorted duplicates are at the start of the list.
The added complexity is that sorting can change the "end" of a partition so the
recursive function must return a tiny bit of information back up the call stack.
So, let's throw together a quick implementation of merge sort:
sub mergesort {
my( $av, $beg, $end )= @_;
if( $beg < $end-1 ) {
my $mid= int( ($beg+$end)/2 );
mergesort( $av, $beg, $mid );
mergesort( $av, $mid+1, $end );
merge( $av, $beg, $mid, $mid+1, $end );
} elsif( $beg == $end-1 ) {
my $cmp= compare( $av->[$beg], $av->[$end] );
if( -1 == $cmp ) {
@$av[$beg,$end]= @$av[$end,$beg];
}
}
}
Now let's teach it to eliminate duplicates as it goes, highlighting what we
had to add:
sub uniqmergesort {
my( $av, $beg, $end )= @_;
if( $beg < $end-1 ) {
my $mid= int( ($beg+$end)/2 );
my $d1= uniqmergesort( $av, $beg, $mid );
#^^^^^^
my $d2= uniqmergesort( $av, $mid+1, $end );
#^^^^^^
$d1 += uniqmerge( $av, $beg, $mid-$d1, $mid+1, $end-$d2 );
#^^^^^ ^^^^ ^^^^ ^^^^
return $d1+$d2;
#^^^^^^^^^^^^^^
}
if( $beg == $end-1 ) {
my $cmp= compare( $av->[$beg], $av->[$end] );
if( -1 == $cmp ) {
@$av[$beg,$end]= @$av[$end,$beg];
return 0;
#^^^^^^^^
}
return 1 if 0 == $cmp;
#^^^^^^^^^^^^^^^^^^^^^^^
}
return 0;
#^^^^^^^^
}
Just for completeness, here are the implementations of merge() and uniqmerge(),
showing similarly minimal changes (not highlighted):
sub merge {
my( $av, $a, $b, $c, $d )= @_;
my $beg= $a;
my @i;
while( $a <= $b && $c <= $d ) {
my $cmp= compare( $av->[$a], $av->[$c] );
push @i, $a++
if -1 != $cmp;
push @i, $c++
if 1 != $cmp;
}
@$av[$beg..$d]= @$av[ @i, $a..$b, $c..$d ];
}
sub uniqmerge {
my( $av, $a, $b, $c, $d )= @_;
my $beg= $a;
my $end= $d;
my @i;
while( $a <= $b && $c <= $d ) {
my $cmp= compare( $av->[$a], $av->[$c] );
push @i, $a++
if -1 != $cmp;
push @i, $c++
if -1 == $cmp;
$c++, $end--
if 0 == $cmp;
}
@$av[$beg..$end]= @$av[ @i, $a..$b, $c..$d ];
return $d-$end;
}
Back to quicksort, in the middle of sorting you have a list in memory that
looks like the following line and each step involves doing a comparison and
then one of three different swaps shown (each swap also must increment or
decrement a variable or two that points to the boundaries between the
different classes of items in the current partition):
(...done...)= items already sorted and (less dups) output
p= the pivot or a duplicate of it
b= "before", things known to be "less than" the pivot
a= "after", things known to be "greater than" the pivot
u= unsorted items (but fit within the current partition)
n= next item to compare to our pivot
(...to-do...)= the partitions to sort after we output this one
(...done...) b b b b b u u u u u n p a a a a a p p p (...to-do...)
if n is a "b", swap: b<------->u
if n is an "a", swap: p-a
if n is a "p", swap: p a<------->p
Eventually yielding:
(...done...) b b b b b b b b p a a a a a a a p p p p (...to-do...)
|---sort next---|
Eventually your partitions get small enough that you have something like:
(...done...) b p a p p (...to-do...)
|write|
And you can output "b p a" and move on to the "to-do" items.
(Of course, an external merge sort like what /usr/bin/sort does then combines these in-memory-sorted lists via merge sorting via temporary files, discarding duplicates during the merge step as already touched on.)
|