comment on

Okay, I converted it to write the keys to a file rather than collect them in a hash.

For simplicity I wrote the keys, offsets and lengths as strings; to conserve space on disk (the sort files were about 91MB each) you could probably use pack to get them into a fixed box.

In any regard, I then split the functions so pre-sort and post-sort activities are in separate scripts. The sort was then run separately as a command line utility.

The results are favorable:

TwoKeySortDisk1.pl (extracts keys/offset/length and writes to file)
Starting RAM:  2.74GB
  Start Time:  05:03:13
    Peak RAM:  2.78GB
 Ending Time:  05:03:32

      Run Time:  19 sec
Peak RAM Usage:  0.04GB * 1024 = 40MB

Sort (probably Cygwin, not Windows native)
Starting RAM:  2.75GB
  Start Time:  05:04:39
    Peak RAM:  3.02GB
 Ending Time:  05:04:49

      Run Time:  10 sec
Peak RAM Usage:  0.27GB * 1024 = 276MB

TwoKeySortDisk2.pl (reads sorted keys, reads original file in random m
+ode, writes output text file)
Starting RAM:  2.73GB
  Start Time:  05:07:56
    Peak RAM:  2.82GB
 Ending Time:  05:08:45

      Run Time:  49 sec
Peak RAM Usage:  0.09GB * 1024 = 92MB
[download]

Well, at least the Perl portion remains under 100MB. :-)

Here's the first script, which pulls out the keys/offsets/lengths and writes them for external sorting.

#!/usr/bin/perl -w

use strict;

my   $NEWLINE_SIZE  = length "\n";      # The size of the newline "cha
+racter" in this OS
my   $OS_ADJUST          = 1;                # A way to somewhat gener
+ically do OS-specific offset computation

my   $KEY_OFFSET    = 'O';                        # Optimized key name
+ for offset value
my   $KEY_LENGTH    = 'L';                        # Optimized key name
+ for length value

my   $SEEK_SET = 0;                     # In case you don't want to ex
+port the constant for seek()

my   $Inpfnm = 'test2.dat';
my   $Wrkfnm = $Inpfnm . '-presort.dat';
my   $Srtfnm = $Inpfnm . '-sorted.dat';
my   $Outfnm = $Inpfnm . '-output.dat';

{
     &convertKeysAndOffsets();
}

exit;

sub convertKeysAndOffsets
{
     my $inputOffset = 0;
     open INPUT_FILE, "<$Inpfnm";
     open PRESORT_FILE, ">$Wrkfnm";
     while (my $inputBuffer = <INPUT_FILE>)
     {
          chomp $inputBuffer;
          # Only capture records which match the structure
          if ($inputBuffer =~ /^\s*key(\d+)\s+key(\d+)\s+/)
          {
               # Capture the keys and record size
               my $primaryKey = $1;
               my $secondaryKey = $2;
               my $inputLength = length $inputBuffer;
               # Optimize the keys
               my $optimizedKey = sprintf "%02d%02d", $primaryKey, $se
+condaryKey;
               my $sortBuffer = "$optimizedKey\|$inputOffset\|$inputLe
+ngth";
               print PRESORT_FILE "$sortBuffer\n";
               # Adjust the offset for read just committed.
               #######################################################
+####################################
               ### WARNING ### Test on small file to ensure you are ge
+tting the right results on your OS #
               #######################################################
+####################################
               $inputOffset += $inputLength;
               $inputOffset += $NEWLINE_SIZE;
               $inputOffset += $OS_ADJUST;
          }
     }
     close PRESORT_FILE;
     close INPUT_FILE;
}


__END__
[download]

Here's the second script, which reads the sorted keys/offsets/lengths and performs the actual big data sort as before.

#!/usr/bin/perl -w

use strict;

my   $NEWLINE_SIZE  = length "\n";      # The size of the newline "cha
+racter" in this OS
my   $OS_ADJUST          = 1;                # A way to somewhat gener
+ically do OS-specific offset computation

my   $KEY_OFFSET    = 'O';                        # Optimized key name
+ for offset value
my   $KEY_LENGTH    = 'L';                        # Optimized key name
+ for length value

my   $SEEK_SET = 0;                     # In case you don't want to ex
+port the constant for seek()

my   $Inpfnm = 'test2.dat';
my   $Wrkfnm = $Inpfnm . '-presort.dat';
my   $Srtfnm = $Inpfnm . '-sorted.dat';
my   $Outfnm = $Inpfnm . '-output.dat';

{
     &sortFile();
     &cleanUp();
}

exit;

sub sortFile
{
     # In the old days this would also be known as shakeTheHardDrive()
     open INPUT_FILE, '<', "$Inpfnm";
     binmode INPUT_FILE;

     open OUTPUT_FILE, ">$Outfnm";
     open SORTED_FILE, "<$Srtfnm";

     while (my $sortedKeyBuffer = <SORTED_FILE>)
     {
          chomp $sortedKeyBuffer;
          my ($keyValue, $inputOffset, $workingLength) = split /\|/, $
+sortedKeyBuffer;

          seek INPUT_FILE, $inputOffset, $SEEK_SET;
          my $inputBuffer = '';
          my $inputCount = read INPUT_FILE, $inputBuffer, $workingLeng
+th;
          print OUTPUT_FILE "$inputBuffer\n";
     }

     close SORTED_FILE;
     close OUTPUT_FILE;
     close INPUT_FILE;
}

sub cleanUp
{
     unlink $Wrkfnm;
     unlink $Srtfnm;
}

__END__
[download]

In reply to Re: sorting type question- space problems by marinersk
in thread sorting type question- space problems by baxy77bax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.