in reply to sorting type question- space problems
Okay, I converted it to write the keys to a file rather than collect them in a hash.
For simplicity I wrote the keys, offsets and lengths as strings; to conserve space on disk (the sort files were about 91MB each) you could probably use pack to get them into a fixed box.
In any regard, I then split the functions so pre-sort and post-sort activities are in separate scripts. The sort was then run separately as a command line utility.
The results are favorable:
TwoKeySortDisk1.pl (extracts keys/offset/length and writes to file) Starting RAM: 2.74GB Start Time: 05:03:13 Peak RAM: 2.78GB Ending Time: 05:03:32 Run Time: 19 sec Peak RAM Usage: 0.04GB * 1024 = 40MB Sort (probably Cygwin, not Windows native) Starting RAM: 2.75GB Start Time: 05:04:39 Peak RAM: 3.02GB Ending Time: 05:04:49 Run Time: 10 sec Peak RAM Usage: 0.27GB * 1024 = 276MB TwoKeySortDisk2.pl (reads sorted keys, reads original file in random m +ode, writes output text file) Starting RAM: 2.73GB Start Time: 05:07:56 Peak RAM: 2.82GB Ending Time: 05:08:45 Run Time: 49 sec Peak RAM Usage: 0.09GB * 1024 = 92MB
Well, at least the Perl portion remains under 100MB. :-)
Here's the first script, which pulls out the keys/offsets/lengths and writes them for external sorting.
#!/usr/bin/perl -w use strict; my $NEWLINE_SIZE = length "\n"; # The size of the newline "cha +racter" in this OS my $OS_ADJUST = 1; # A way to somewhat gener +ically do OS-specific offset computation my $KEY_OFFSET = 'O'; # Optimized key name + for offset value my $KEY_LENGTH = 'L'; # Optimized key name + for length value my $SEEK_SET = 0; # In case you don't want to ex +port the constant for seek() my $Inpfnm = 'test2.dat'; my $Wrkfnm = $Inpfnm . '-presort.dat'; my $Srtfnm = $Inpfnm . '-sorted.dat'; my $Outfnm = $Inpfnm . '-output.dat'; { &convertKeysAndOffsets(); } exit; sub convertKeysAndOffsets { my $inputOffset = 0; open INPUT_FILE, "<$Inpfnm"; open PRESORT_FILE, ">$Wrkfnm"; while (my $inputBuffer = <INPUT_FILE>) { chomp $inputBuffer; # Only capture records which match the structure if ($inputBuffer =~ /^\s*key(\d+)\s+key(\d+)\s+/) { # Capture the keys and record size my $primaryKey = $1; my $secondaryKey = $2; my $inputLength = length $inputBuffer; # Optimize the keys my $optimizedKey = sprintf "%02d%02d", $primaryKey, $se +condaryKey; my $sortBuffer = "$optimizedKey\|$inputOffset\|$inputLe +ngth"; print PRESORT_FILE "$sortBuffer\n"; # Adjust the offset for read just committed. ####################################################### +#################################### ### WARNING ### Test on small file to ensure you are ge +tting the right results on your OS # ####################################################### +#################################### $inputOffset += $inputLength; $inputOffset += $NEWLINE_SIZE; $inputOffset += $OS_ADJUST; } } close PRESORT_FILE; close INPUT_FILE; } __END__
Here's the second script, which reads the sorted keys/offsets/lengths and performs the actual big data sort as before.
#!/usr/bin/perl -w use strict; my $NEWLINE_SIZE = length "\n"; # The size of the newline "cha +racter" in this OS my $OS_ADJUST = 1; # A way to somewhat gener +ically do OS-specific offset computation my $KEY_OFFSET = 'O'; # Optimized key name + for offset value my $KEY_LENGTH = 'L'; # Optimized key name + for length value my $SEEK_SET = 0; # In case you don't want to ex +port the constant for seek() my $Inpfnm = 'test2.dat'; my $Wrkfnm = $Inpfnm . '-presort.dat'; my $Srtfnm = $Inpfnm . '-sorted.dat'; my $Outfnm = $Inpfnm . '-output.dat'; { &sortFile(); &cleanUp(); } exit; sub sortFile { # In the old days this would also be known as shakeTheHardDrive() open INPUT_FILE, '<', "$Inpfnm"; binmode INPUT_FILE; open OUTPUT_FILE, ">$Outfnm"; open SORTED_FILE, "<$Srtfnm"; while (my $sortedKeyBuffer = <SORTED_FILE>) { chomp $sortedKeyBuffer; my ($keyValue, $inputOffset, $workingLength) = split /\|/, $ +sortedKeyBuffer; seek INPUT_FILE, $inputOffset, $SEEK_SET; my $inputBuffer = ''; my $inputCount = read INPUT_FILE, $inputBuffer, $workingLeng +th; print OUTPUT_FILE "$inputBuffer\n"; } close SORTED_FILE; close OUTPUT_FILE; close INPUT_FILE; } sub cleanUp { unlink $Wrkfnm; unlink $Srtfnm; } __END__
|
|---|