Re^5: [OT] A measure of 'sortedness'?

Not sure -- it's been a long time since I've had anything detailed enough to warrant computational analysis of the execution cost (I miss the old days when I worked on fun stuff).

A side trip, though, if a full run length index might have had value (does not look like it here), here's a Perlized representation. Some of the code is sloppy, but I didn't go home like I said I would so I optimized Wife-Deprivation Units at the cost of Code-I-Could-Be-Proud-Of Units.

Okay, so here's a proof-of-concept. The assumption is that a Run-Length table will consume notably less space than the original data, such as might be true in a disk I/O operation. This is not a universally-applicable assumption but may yield thoughts toward optimization in your specific case.

#!/usr/bin/perl

use strict;


# Constants (sort of)
my   $MAIN_BUFFER_SIZE   = 20;
my   $POSSIBLE_VALUES    = 'ABCDE';
my   $NUMBER_OF_VALUES   = length $POSSIBLE_VALUES;         # Silly bu
+t deal with it


# Globals
my   %RunLength_Info          = ();
my   @MainBuffer = ();


# Generate random buffers with probable data overlaps, then scan, then
+ merge
{
     foreach my $bufferPass (0..1)
     {
          my $workingBuffer = &generateMainBuffer();
          push @MainBuffer, $workingBuffer;

          &scanBuffer($bufferPass);
     }

     &dumpMainBuffer();

     &mergeBuffers();
}

exit;


sub generateMainBuffer
{
     my @genBuffer = ();

     for (my $bufIndex = 0; $bufIndex < $MAIN_BUFFER_SIZE; $bufIndex++
+)
     {
          my $bufCharValue = int(rand($NUMBER_OF_VALUES));
          my $bufChar = substr $POSSIBLE_VALUES, $bufCharValue, 1;
          push @genBuffer, $bufChar;
     }
     my @sortedBuffer = sort @genBuffer;
     my $generatedBuffer = join '', @sortedBuffer;

     return $generatedBuffer;
}


sub scanBuffer
{
     my ($bufferNumber, @garbage) = @_;
     if (!defined $bufferNumber) { die "Found a short-circuit between 
+the ear buds"; }

     for (my $bufIndex = 0; $bufIndex < $MAIN_BUFFER_SIZE; $bufIndex++
+)
     {
          my $currChar = substr $MainBuffer[$bufferNumber], $bufIndex,
+ 1;
          if (!defined $RunLength_Info{$currChar})
          {
               $RunLength_Info{$currChar} = 1;
          }
          else
          {
               $RunLength_Info{$currChar}++;
          }
     }
}


sub mergeBuffers
{
     my $outBuffer = 0;
     my $outLength = 0;

     # Stupid but effective
     foreach my $bufferClean (0..1)
     {
          $MainBuffer[$bufferClean] = '';
     }

     # And the magic
     foreach my $finalChar (sort keys %RunLength_Info)
     {
          my $remCount = $RunLength_Info{$finalChar};

          while ($remCount)
          {
               if ($outLength < $MAIN_BUFFER_SIZE)
               {
                    $MainBuffer[$outBuffer] .= $finalChar;
                    $outLength++;
                    $remCount--;
               }
               else
               {
                    $outBuffer++;
                    $outLength = 0;
               }
          }
     }
     &dumpMainBuffer();
}


sub dumpMainBuffer
{
     print "Buffers:\n";
     foreach my $aBuffer (@MainBuffer)
     {
          print "     [$aBuffer]\n";
     }
}


__END__
[download]

Results on my test run:

D:\PerlMonks>perl merge1.pl
Buffers:
     [AAAAAABBCCCDDDDEEEEE]
     [AAAABBBBBBCCCCCDDDDD]
Buffers:
     [AAAAAAAAAABBBBBBBBCC]
     [CCCCCCDDDDDDDDDEEEEE]

D:\PerlMonks>
[download]

Comment on Re^5: [OT] A measure of 'sortedness'? Select or Download Code

Replies are listed 'Best First'.

Re^6: [OT] A measure of 'sortedness'?
by BrowserUk (Patriarch) on Mar 19, 2015 at 03:31 UTC

With a (typically) minimum keysize of 64-bits, whilst repeats are not impossible, they are unlikely.

But also, the same problem is true of my "equal to or greater by one" runs; they simply won't happen often enough to be of value.

What's needed are runs of adjacent values in one buffer that don't appear in the other buffer; and vice versa.

That means tracking both buffers concurrently -- which as I indicated in the OP is quite possible -- but it requires significantly more code and comparisons. Sufficiently more that it probably doesn't save anything when doing the merge :(

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

[reply]

Re^7: [OT] A measure of 'sortedness'?

by marinersk (Priest) on Mar 19, 2015 at 15:18 UTC

I'm not ready to give up but it seems all the low-hanging fruit doesn't fit into the salad.

I suspect you are correct, however, that without a new, inspired algorithmic approach to identifying optimization points in the merge, the platform constraints hamper conventional efforts to pre-select merge points (and other optimization efforts of that ilk).

In short, not enough hardware, too much data variance.

I'm not saying it can't be done. I'm saying no path to solution is clear to me at this point.

Solution to this one will require either a profound application of theoretical mathematics, or an invention -- likely one worthy of a patent.

[reply]

Re^7: [OT] A measure of 'sortedness'?

by marinersk (Priest) on Mar 19, 2015 at 20:33 UTC

I think your bitmap idea still has merit, but employed slightly diffrently. It's not an algorithmic reduction, but it does reduce overhead by 50% (one buffer scan instead of two). Not going to win any awards, but 50% overhead reduction isn't devoid of merit.

If you build the bitmap during the mandatory pre-merge scans (this would require a parallel, non-lock-step scan as you have already proposed) and use it to merely indicate which source buffer the resulting buffer data will come from (essentially doing the merge decision during the pre-scan), then your merge pass doesn't have to think -- it just has to do.

Buffers:
     [AABEFFHHKMOOQRRSSTUV]
     [ABCDEHIJKLLNPPRSVVWY]

Bitmap:
     [0011011100000111101101001100011000001111]
[download]

If you can cheaply convert the length of the alternating runs into an integer

      <lightbulb>

      or just store them as (2- or 4-bit?) integers to begin with

      </lightbulb>

Then you can just feed that into your block data move routine and achieve maximum blocking/minimum data move, all from metadata collected on that first mandatory linear pass.

I would presume you'd use the rest of that 10% OOL buffer to perform the actual data migration.

      <lightbulb>

      If you performed the actual data migration in reverse, you could shrink the bitmap/runlength data as you use it, and thus increase the amount of linear usable freespace in the buffer for the actual data migration, which you would naturally build from the back of the OOL buffer and stop before you step on your shrinking metadata. By the last pass you could be using almost the entire OOL buffer for data.

      </lightbulb>

On each pass, the OOL buffer looks something like below. I say "something like" because, like a bad car mechanic, I wound up with spare parts at the end -- I clearly messed something up doing it by hand in a text editor LOL. (This is why I write programs to do the work for me. They don't miss. They don't get bored. They just do.)

OOL Buffer:

  The front of the buffer contains the runlength index
  The back of the buffer contains the data to be migrated
  The dots represent freespace in the buffer

     [2213541211223254...........................................]
     [221354121122325........................................VVWY]
     [22135412112232....................................SSTUVVVWY]
     [2213541211223...................................RSSSTUVVVWY]
     [221354121122.................................QRRRSSSTUVVVWY]
     [22135412112................................PPQRRRSSSTUVVVWY]
     [2213541211...............................OOPPQRRRSSSTUVVVWY]
     [221354121...............................NOOPPQRRRSSSTUVVVWY]
     [22135412..............................LMNOOPPQRRRSSSTUVVVWY]
     [2213541.............................KLLMNOOPPQRRRSSSTUVVVWY]
     [221354.............................KKLLMNOOPPQRRRSSSTUVVVWY]
     [22135..........................HHIJKKLLMNOOPPQRRRSSSTUVVVWY]
     [2213......................EEFFHHHIJKKLLMNOOPPQRRRSSSTUVVVWY]
     [221....................BCDEEFFHHHIJKKLLMNOOPPQRRRSSSTUVVVWY]
     [22....................BBCDEEFFHHHIJKKLLMNOOPPQRRRSSSTUVVVWY]
     [2...................AABBCDEEFFHHHIJKKLLMNOOPPQRRRSSSTUVVVWY]
[download]

Of course you still have all the same other problems to solve: How do you minimize data motion and where do you put data which is going to be displaced by the eventual write from OOL to Main buffer, etc.

I'd have to model it to get a feel for what the real savings would be, if any.

It is dependent on either an inexpensive conversion of bitrun-to-integer, or the ability to use and store reasonably-sized tinyint incrementors to be of use. Would not expect Perl to be your tool of choice. :-)

Update: Removed department of redundancy department.

[reply]
[d/l]
[select]

Re^8: [OT] A measure of 'sortedness'?

by BrowserUk (Patriarch) on Mar 19, 2015 at 21:46 UTC

to merely indicate which source buffer the resulting buffer data

I think the process of: iterating & comparing & building the bitmap during the pre-scan + iterating and decoding the bitmap in the next stage; is more expensive than: reiterating and comparing during the next stage.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

[reply]

Re^9: [OT] A measure of 'sortedness'?

by marinersk (Priest) on Mar 20, 2015 at 14:01 UTC