comment on

Okay. This won't quite fit in 100MB but it's pretty close (~120MB 160MB +{general Perl overhead}) unless Perl is a lot less efficient on its hash storage than I suspect.

Update: Correction: The sort on the hash keys will add another array of about 40MB, so we're up to ~160MB plus general overhead. Grrr. If you're really constrained to 100MB, you're gonna' have to go with Random_Walk's iterative crawl. The trade-off is space vs. time. Typical issue. You get to make the call.

The technique I employed (I believe I saw others suggest it but not sure) was to scan the file once in order to collect keys, offsets, and lengths of the original lines. Then a second pass reads them in order from the original file in binary mode (we used to call this Random IO), and writes them sequentially to the output file. As a side benefit of the approach, the original sequence of full matching keysets is preserved.

Some potential issues with this approach:

For 20GB file, requires 64-bit integers. Dunno how to do that, hope you do.
Commensurate with 32-bit issue, dunno if your Perl can handle files > 2GB
The keys in your first and second examples are different; first example shows key1-key43, but the second example shows key01-key43. So long as key1 and key01 are considered synonynous, then this deeper optimization will work fully.
If the keys are not actually keynn, skip the deep optimization, store the full keys, consider making a single key by combining the two keys with something unique like a pipe separator to gain some slight optimization (e.g. my $optimizedKey = "$primaryKey\|$secondaryKey";).

There are other considerations but you sound sharp enough to catch them.

To try to cram this stuff into as small a space on PerlMonks as possible, I left out all the error checking, so bad opens and bad seeks will send you flying into the pasture without a helmet. 'nuff said.

Without further adieu:

#!/usr/bin/perl -w

use strict;

my   $NEWLINE_SIZE  = length "\n";      # The size of the newline "cha
+racter" in this OS
my   $OS_ADJUST          = 1;                # A way to somewhat gener
+ically do OS-specific offset computation

my   $KEY_OFFSET    = 'O';                        # Optimized key name
+ for offset value
my   $KEY_LENGTH    = 'L';                        # Optimized key name
+ for length value

my   $SEEK_SET = 0;                     # In case you don't want to ex
+port the constant for seek()

my   %SeekInfo = ();

{
     &loadKeysAndOffsets();
     &sortFile();
}

exit;

sub loadKeysAndOffsets
{
     my $inputOffset = 0;
     open INPUT_FILE, "<test1.dat";
     while (my $inputBuffer = <INPUT_FILE>)
     {
          chomp $inputBuffer;
          # Only capture records which match the structure
          if ($inputBuffer =~ /^\s*key(\d+)\s+key(\d+)\s+/)
          {
               # Capture the keys and record size
               my $primaryKey = $1;
               my $secondaryKey = $2;
               my $inputLength = length $inputBuffer;
               # Optimize the keys
               my $optimizedKey = sprintf "%02d%02d", $primaryKey, $se
+condaryKey;
               # Store the offset in a HoA (covers possibility of dupl
+icate key pairs)
               push @{$SeekInfo{$optimizedKey}{$KEY_OFFSET}}, $inputOf
+fset;
               push @{$SeekInfo{$optimizedKey}{$KEY_LENGTH}}, $inputLe
+ngth;
               # Adjust the offset for read just committed.
               #######################################################
+####################################
               ### WARNING ### Test on small file to ensure you are ge
+tting the right results on your OS #
               #######################################################
+####################################
               $inputOffset += $inputLength;
               $inputOffset += $NEWLINE_SIZE;
               $inputOffset += $OS_ADJUST;
          }
     }
     close INPUT_FILE;
}

sub sortFile
{
     # In the old days this would also be known as shakeTheHardDrive()
     open INPUT_FILE, '<', 'test1.dat';
     binmode INPUT_FILE;

     open OUTPUT_FILE, ">test1.dat-output.dat";

     foreach my $keyValue (sort keys %SeekInfo)
     {
          my @workingLength = ();
          push @workingLength, @{$SeekInfo{$keyValue}{$KEY_LENGTH}};
          foreach my $inputOffset (@{$SeekInfo{$keyValue}{$KEY_OFFSET}
+})
          {
               my $workingLength = shift @workingLength;
               seek INPUT_FILE, $inputOffset, $SEEK_SET;
               my $inputBuffer = '';
               my $inputCount = read INPUT_FILE, $inputBuffer, $workin
+gLength;
               print OUTPUT_FILE "$inputBuffer\n";
          }
     }

     close OUTPUT_FILE;
     close INPUT_FILE;
}

__END__


5M lines
20 GB file = 20 * 1024 * 1024 * 1024 = 21,474,836,480
To store offsets into this file you will need more than 2GB:  Cannot u
+se 4-byte integer
So we need to go with 8-byte integers (hope your Perl supports this)

5M lines x 8 bytes per offset slot = 40,000,000 which is ~40MB RAM so 
+far.
Double it to store record size as well; up to ~80MB RAM
     -- Here we could probably use 4-byte integers
     -- But can you do both in the same script?
Triple it to store optimized hash keys puts us up to 120MB RAM.
Perl extras will likely be minimal.
So, not quite fitting in 100MB but really close.
     -- Can squeeze it in if 32-bit and 64-bit integers can cohabitate
+ (??)
[download]

Results:

C:\Steve\Dev\PerlMonks\P-2013-09-16@0043-TwoKeySort>perl TwoKeySort.pl

C:\Steve\Dev\PerlMonks\P-2013-09-16@0043-TwoKeySort>type test1.dat
key1  key2  ndnjfgdsjfjjkjjfjf...
key1  key2  kdfkjdfgdfugbjndkfgkjgndkjfjkd
key43 key21 sdkjfhdghdbgbd
key1  key3  jujdejnsduhffnjj
key2  key2  jhzezhdjjf...

C:\Steve\Dev\PerlMonks\P-2013-09-16@0043-TwoKeySort>type test1.dat-out
+put.dat
key1  key2  ndnjfgdsjfjjkjjfjf...
key1  key2  kdfkjdfgdfugbjndkfgkjgndkjfjkd
key1  key3  jujdejnsduhffnjj
key2  key2  jhzezhdjjf...
key43 key21 sdkjfhdghdbgbd

C:\Steve\Dev\PerlMonks\P-2013-09-16@0043-TwoKeySort>
[download]

In reply to Re: sorting type question- space problems by marinersk
in thread sorting type question- space problems by baxy77bax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.