perlquestion
jeroenes
Dear fellow monks,
<p>
I have a design/ tactical problem for you.
<h2>Intro (skippable)</h2>
I'm analysing
some large stretches of data at the moment. I have been
writing some scripts to fetch data from a mere 2 gigabyte
of text-files, writing needed info to a 220 meg binary
file, and chopped that one in two. All sweet, but than
I thought, well, it must be more efficient to make a
sorted file with perl, and access the data via perl
from matlab than to let matlab sort 'm. Just because
the matrix is too large to fit in real memory, and I'm
not allowed to buy more megs for my box.
<p>
<h2>Problem</h2>
So I want to sort some 105 megs of binary data on a 16 bits
integer. Record length is 8 bytes ('SLS'). When I just
read it in, perl dies with 'Out of memory'. Rather strange,
because I only have 13 M items, and 256M physical memory
and 128 M swap. I guess the sort expands the int's to 8 byte
floats, but even that should fit. However, I just
accept that.
<p>
<h2>Solution 1: Changing the perl commands</h2>
Here is the code I use to sort, I hope you can pinpoint
me a problem:
<code>
#!/usr/bin/perl -w
#constants
$record_size = 8;
$amp_index = 6;
$amp_format = 'S';
$time_length = 6;
$sorted = $big=$ARGV[0];
$sorted =~ s/(\.bin)$/_sorted$1/;
#load the data from $big
open DATA, $big;
binmode DATA;
my $amp_a = loadamp();
close DATA;
die "Could not read data from $big" unless @{$amp_a->[0]};
print "Ready with loading\n";
#sort on first item: amp
my @sortamp = sort{ $a->[0] <=> $b->[0] } @$amp_a;
#print out in the 'bigfile' order ([0] contains amp, [1] string with time-info
open OUT, ">$sorted";
binmode OUT;
for $amp( @sortamp ){
print OUT $amp->[1].pack( $amp_format, $amp->[0]);
}
close OUT;
sub loadamp{
my $str;
my @amp_a;
my $n = 1;
do {
$n = sysread( DATA, $str, $record_size);
sysseek( DATA, $record_size, 1);
(my $amp)=unpack( $amp_format, substr( $str, $amp_index, 2) );
push(@amp_a, [$amp,substr( $str, 0, $time_length)]) if $n;
} while ( $n);
\@amp_a;
}
</code>
<h2>Solution 2: Sort in file</h2>
This could be very inefficient, because you will have
to read and write 100 meg chunks many times again.
<h2>Solution 2: Seek in file</h2>
Maybe I could use [Search::binary] to find ints of
a certain value, and repeat that for all values of
int in the file.
<h2>Solution 3: Use known distribution of ints</h2>
I know how by approximation how the ints are distributed.
So I can start a brand new file, fill in the expected
boundaries, and move them as seems fit.
<h2>Solution 4: External module/ (C) library</h2>
Maybe I'd better use some other library to sort my ints.
They could be more memory efficient. But I'm really lost
here. Couldn't find any reference to binary-sorting in
CPAN or at perlmonks.org.
<p>
I hope you can appreciate this problem, and I'm curious
to any idea's, suggestions, etc. Please point me some
nice direction!
<p>
In high regard of you all,
<p>
Jeroen<br>
<i>"We are not alone"(FZ)</i>