Okay, I converted it to write the keys to a file rather than collect them in a hash.

For simplicity I wrote the keys, offsets and lengths as strings; to conserve space on disk (the sort files were about 91MB each) you could probably use pack to get them into a fixed box.

In any regard, I then split the functions so pre-sort and post-sort activities are in separate scripts. The sort was then run separately as a command line utility.

The results are favorable:

TwoKeySortDisk1.pl (extracts keys/offset/length and writes to file) Starting RAM: 2.74GB Start Time: 05:03:13 Peak RAM: 2.78GB Ending Time: 05:03:32 Run Time: 19 sec Peak RAM Usage: 0.04GB * 1024 = 40MB Sort (probably Cygwin, not Windows native) Starting RAM: 2.75GB Start Time: 05:04:39 Peak RAM: 3.02GB Ending Time: 05:04:49 Run Time: 10 sec Peak RAM Usage: 0.27GB * 1024 = 276MB TwoKeySortDisk2.pl (reads sorted keys, reads original file in random m +ode, writes output text file) Starting RAM: 2.73GB Start Time: 05:07:56 Peak RAM: 2.82GB Ending Time: 05:08:45 Run Time: 49 sec Peak RAM Usage: 0.09GB * 1024 = 92MB

Well, at least the Perl portion remains under 100MB. :-)

Here's the first script, which pulls out the keys/offsets/lengths and writes them for external sorting.

#!/usr/bin/perl -w use strict; my $NEWLINE_SIZE = length "\n"; # The size of the newline "cha +racter" in this OS my $OS_ADJUST = 1; # A way to somewhat gener +ically do OS-specific offset computation my $KEY_OFFSET = 'O'; # Optimized key name + for offset value my $KEY_LENGTH = 'L'; # Optimized key name + for length value my $SEEK_SET = 0; # In case you don't want to ex +port the constant for seek() my $Inpfnm = 'test2.dat'; my $Wrkfnm = $Inpfnm . '-presort.dat'; my $Srtfnm = $Inpfnm . '-sorted.dat'; my $Outfnm = $Inpfnm . '-output.dat'; { &convertKeysAndOffsets(); } exit; sub convertKeysAndOffsets { my $inputOffset = 0; open INPUT_FILE, "<$Inpfnm"; open PRESORT_FILE, ">$Wrkfnm"; while (my $inputBuffer = <INPUT_FILE>) { chomp $inputBuffer; # Only capture records which match the structure if ($inputBuffer =~ /^\s*key(\d+)\s+key(\d+)\s+/) { # Capture the keys and record size my $primaryKey = $1; my $secondaryKey = $2; my $inputLength = length $inputBuffer; # Optimize the keys my $optimizedKey = sprintf "%02d%02d", $primaryKey, $se +condaryKey; my $sortBuffer = "$optimizedKey\|$inputOffset\|$inputLe +ngth"; print PRESORT_FILE "$sortBuffer\n"; # Adjust the offset for read just committed. ####################################################### +#################################### ### WARNING ### Test on small file to ensure you are ge +tting the right results on your OS # ####################################################### +#################################### $inputOffset += $inputLength; $inputOffset += $NEWLINE_SIZE; $inputOffset += $OS_ADJUST; } } close PRESORT_FILE; close INPUT_FILE; } __END__

Here's the second script, which reads the sorted keys/offsets/lengths and performs the actual big data sort as before.

#!/usr/bin/perl -w use strict; my $NEWLINE_SIZE = length "\n"; # The size of the newline "cha +racter" in this OS my $OS_ADJUST = 1; # A way to somewhat gener +ically do OS-specific offset computation my $KEY_OFFSET = 'O'; # Optimized key name + for offset value my $KEY_LENGTH = 'L'; # Optimized key name + for length value my $SEEK_SET = 0; # In case you don't want to ex +port the constant for seek() my $Inpfnm = 'test2.dat'; my $Wrkfnm = $Inpfnm . '-presort.dat'; my $Srtfnm = $Inpfnm . '-sorted.dat'; my $Outfnm = $Inpfnm . '-output.dat'; { &sortFile(); &cleanUp(); } exit; sub sortFile { # In the old days this would also be known as shakeTheHardDrive() open INPUT_FILE, '<', "$Inpfnm"; binmode INPUT_FILE; open OUTPUT_FILE, ">$Outfnm"; open SORTED_FILE, "<$Srtfnm"; while (my $sortedKeyBuffer = <SORTED_FILE>) { chomp $sortedKeyBuffer; my ($keyValue, $inputOffset, $workingLength) = split /\|/, $ +sortedKeyBuffer; seek INPUT_FILE, $inputOffset, $SEEK_SET; my $inputBuffer = ''; my $inputCount = read INPUT_FILE, $inputBuffer, $workingLeng +th; print OUTPUT_FILE "$inputBuffer\n"; } close SORTED_FILE; close OUTPUT_FILE; close INPUT_FILE; } sub cleanUp { unlink $Wrkfnm; unlink $Srtfnm; } __END__

In reply to Re: sorting type question- space problems by marinersk
in thread sorting type question- space problems by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.