http://qs1969.pair.com?node_id=53920

jeroenes has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I have a design/ tactical problem for you.

Intro (skippable)

I'm analysing some large stretches of data at the moment. I have been writing some scripts to fetch data from a mere 2 gigabyte of text-files, writing needed info to a 220 meg binary file, and chopped that one in two. All sweet, but than I thought, well, it must be more efficient to make a sorted file with perl, and access the data via perl from matlab than to let matlab sort 'm. Just because the matrix is too large to fit in real memory, and I'm not allowed to buy more megs for my box.

Problem

So I want to sort some 105 megs of binary data on a 16 bits integer. Record length is 8 bytes ('SLS'). When I just read it in, perl dies with 'Out of memory'. Rather strange, because I only have 13 M items, and 256M physical memory and 128 M swap. I guess the sort expands the int's to 8 byte floats, but even that should fit. However, I just accept that.

Solution 1: Changing the perl commands

Here is the code I use to sort, I hope you can pinpoint me a problem:
#!/usr/bin/perl -w #constants $record_size = 8; $amp_index = 6; $amp_format = 'S'; $time_length = 6; $sorted = $big=$ARGV[0]; $sorted =~ s/(\.bin)$/_sorted$1/; #load the data from $big open DATA, $big; binmode DATA; my $amp_a = loadamp(); close DATA; die "Could not read data from $big" unless @{$amp_a->[0]}; print "Ready with loading\n"; #sort on first item: amp my @sortamp = sort{ $a->[0] <=> $b->[0] } @$amp_a; #print out in the 'bigfile' order ([0] contains amp, [1] string with t +ime-info open OUT, ">$sorted"; binmode OUT; for $amp( @sortamp ){ print OUT $amp->[1].pack( $amp_format, $amp->[0]); } close OUT; sub loadamp{ my $str; my @amp_a; my $n = 1; do { $n = sysread( DATA, $str, $record_size); sysseek( DATA, $record_size, 1); (my $amp)=unpack( $amp_format, substr( $str, $amp_index, 2) ); push(@amp_a, [$amp,substr( $str, 0, $time_length)]) if $n; } while ( $n); \@amp_a; }

Solution 2: Sort in file

This could be very inefficient, because you will have to read and write 100 meg chunks many times again.

Solution 2: Seek in file

Maybe I could use Search::binary to find ints of a certain value, and repeat that for all values of int in the file.

Solution 3: Use known distribution of ints

I know how by approximation how the ints are distributed. So I can start a brand new file, fill in the expected boundaries, and move them as seems fit.

Solution 4: External module/ (C) library

Maybe I'd better use some other library to sort my ints. They could be more memory efficient. But I'm really lost here. Couldn't find any reference to binary-sorting in CPAN or at perlmonks.org.

I hope you can appreciate this problem, and I'm curious to any idea's, suggestions, etc. Please point me some nice direction!

In high regard of you all,

Jeroen
"We are not alone"(FZ)

Replies are listed 'Best First'.
Re: Sorting data that don't fit in memory
by dws (Chancellor) on Jan 24, 2001 at 13:45 UTC
    Here's a pointer.

    If you can't fit the data into virtual memory, you're going to have to make a multi-pass sort, with intermediate results stored on disk. One algorithm you might consider is the "Radix" sort, which is a multipass, order preserving sort that is well-suited for very data sets with fixed-sized keys, and has O(n) behavior.

    See Knuth, Vol. 1, Sorting and Searching for a complete description. As you read Knuth, consider how you'd use disk files for the intermediate partitions.

Re: Sorting data that don't fit in memory
by clemburg (Curate) on Jan 24, 2001 at 14:31 UTC

    Mastering Algorithms with Perl discusses your problem on page 150 (External Sorting) and recommends mergesort as the "easiest solution". Mergesort in Perl is described on page 134 and 135. If I recall correctly, there is also a discussion of your problem in Jon Bentley's Programming Pearls. A third source of guidance that is still less extensive than Knuth is Introduction to Algorithms by Cormen, Leiserson, and Rivest (my favourite, contains excellent pseudocode and very good discussions).

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Re (tilly) 1: Sorting data that don't fit in memory
by tilly (Archbishop) on Jan 24, 2001 at 17:30 UTC
    For sorting more data than you have room for I recommend tieing a hash through DB_File to a BTree. That allows you to just put items into the hash and then pull them out in sorted order.

    There is some overhead to the structure, so without large file support this will usually only suffice for a little over a GB of data. With large file support and sufficient disk, you can handle many terabytes efficiently.

      ++ for every answer in this thread. I'm very pleased with your responses. But there can only be one.... and this time it's tilly. I've been busy all afternoon reading the replies, reading docs and installing Berkeley-related stuff, but now I've run the first few tests after a pretty short time writing. I like BerkeleyDB that much, that I've decided to write most of my logic in perl. Smile!

      The script runs pretty fast for small amounts of data, up to 100k items was no problem. Now I'm waiting for the 1M test as I write. Will keep you posted here.

      Thanks a lot,

      Jeroen
      "We are not alone"(FZ)

      That technique looks very interesting. Where can I find more information about how it works (independent of Perl, just the algorithm)?
        Please note that when I wrote that I was more conversant with theory than practice. Since then I've gained some practical experience. (Including sorting a dataset that was too big to store on the disk in uncompressed form!)

        The suggestion will work. It is just the wrong solution to use with a large dataset.

        Let me explain the problem. Suppose we have, say, 100 million lines to sort. Lines average 70 bytes each. We're doing this on a hard drive which turns at 6000 rpm and which can sustain 100 MB/second. How long does the btree solution take?

        Well what is disk seek time? 6000 rpm means 100 revolutions per second. When you choose to seek, the disk could be anywhere, so on average you have half a revolution. So average seek time is 1/200th of a second. We have (ballpark) 100 million writes to make, so 100 million seeks will take us about 500,000 seconds, which is about 5.8 days. (I've left out a number of complicating factors, but still it will take a matter of days to do it.)

        Now let's consider sorting on disk using a merge sort. Merge sort is the process of repeatedly taking sorted lists and merging them into longer sorted lists. If you're clever about it (any good merge sort implementation will be), you can arrange to always be merging lists of about the same size, and you can arrange to not have very many lists lying around at once.

        Our dataset is 7 GB (100 million lines times 70 bytes per line). With 100 million elements, a merge sort will finish in 27 passes. (There is a power of 2 pattern since with every pass the length of the lists double.) On each pass you have to read and write the data. So that is 7 GB * 27 * 2 = 378 GB. At 100 MB/s that takes 3780 seconds, or about an hour. (Again I've massively oversimplified, but that is about right.)

        Sure, both solutions finish, but which solution would you prefer to run?

        This is not to say that btrees don't have their place. They do. Particularly if you need to keep a changing dataset in sorted order. But if you want to sort a large dataset one time, they are not optimal. And, in fact, if you want to put a large dataset into a btree, you should first sort it with a mergesort then put it in the btree.

        The underlying data structure is a B-tree
Re: Sorting data that don't fit in memory
by eg (Friar) on Jan 24, 2001 at 15:28 UTC

    I'm no expert on internals, but I'm pretty sure you could save at least 1/2 of your memory by changing @amp_a from being an array of anonymous arrays (which has an RV, an AV and 2 SV's per element) to a straight array of scalars (only one SV per element).

    Then, instead of carrying around your cumbersome, unpacked data, you can just unpack in the sort:

    my @sortamp = sort{ unpack(..., substr(...$a)) <=> unpack(..., substr( +...$b) } @$amp_a;

    It might be slow, but (hopefully) you won't run out of memory.

    You don't mention what platform you're running on, but if it's little-endian, this ought to work (because the sort key is an integer at the end of your data):

    my @sortamp = sort{ reverse($a) cmp reverse($b) } @$amp_a;

    (A straight string compare with the most significant bit in front.)

      While I'm looking into the possibilities provided by the other posters, these tips don't require more advanced knowledge than I have ;-).

      Thank you for warning me about the anonymous array overhead. Will change that immediately.

      Furthermore, I'm on little endian, so the string comparison should work just fine. However, I don't want to sort on the remainder of the string as it is now. Will run a test first, and post the results here.

      Thanks a lot,

      Jeroen
      "We are not alone"(FZ)

      Update: Must have made some difference in memory usage, but it's still way too much. Made a rough calculation of the memory usage with system monitor (gnome) and this little script:

      #!/usr/bin/perl my $size=1E6; my $f = 2; print "Increasing memory print each step by a factor $f\n\n"; while(1){ print "\tCreating array of $size items...Press enter to continue.... +"; my $b = <STDIN>; my @a=1..$size; print "...done. Press enter to continue."; $b = <STDIN>; chomp $b; $b=~ /[qxQX]/ and exit; $size *= $f; }
      It's about 44 megabytes for every million items in an array. No way my 12 M records are going to fit in physical memory. I'm back to Radixsort and alike.

        See numbers OK; Re: sorting comma separated value file for an example of how to efficiently sort (in terms of memory use and CPU time). You use two arrays, not an array of tiny arrays. Though I don't think you'll end up using this. (:

        However, I don't want to sort on the remainder of the string as it is now.

        The remainder of the string only enters into it if the first part is the same. Without sorting on the remainder of the string, the order for records with the same 16-bit integer will be "random". So you lose nothing by sorting on that extra data (other than the time involved in comparing those few extra bytes, which seems a net win considering the memory that can be saved).

        But you can save a ton of memory by not storing any of your records in memory as follows:

        my $maxrecs= 8*1024*1024; # Whatever you determine fits my $recsize= 8; my $sortsize= 2; my $sortoff= 6; # Here is the only memory hog: my $sorton= " "x($maxrecs*$sortsize); my $idx= 0; # Note that I don't use sysread() here as I think the # buffering offered by read() may improve speed: while( $idx < $maxrecs && read(FILE,$rec,$recsize) ) { substr( $sorton, $idx++*$sortsize, $sortsize )= substr( $rec, $sortoff, $sortsize ); } my @idx= sort { substr($sorton,$a*$sortsize,$sortsize) cmp substr($sorton,$b*$sortsize,$sortsize) } 0..($idx-1); for $idx ( @idx ) { seek( FILE, $idx*$recsize, 0 ); sysread( FILE, $rec, $recsize ); print OUT, $rec; # or substr($rec,0,6) }

        Personally I'd just figure out how many records you can sort using this modification and sort that many, write the sorted list out. Repeat with a new output file until you have, oh, 64 output files or no more data. Then merge the 64 output files into one. Repeat until you have 64 merged files or no data. Merge the merged files.

        For merging I'd use a heap (an efficient way of inserting new items into a partially sorted list such that you can efficiently always pull out the "first" item from the list).

        Let me know if you need more details but I suspect the several references already mentioned should cover this.

                - tye (but my friends call me "Tye")
Brute force (Re: Sorting data that don't fit in memory)
by dws (Chancellor) on Jan 24, 2001 at 22:41 UTC
    If you're in a hurry to get something working, and will have time to work on a faster method later, consider a brute force approach: only sort what you can fit in memory, and then merge the resulting sorted files.

    "It's not the Zen way, but it gets the job done." -- Garrison Keiler

Re: Sorting data that don't fit in memory
by dragonchild (Archbishop) on Oct 06, 2007 at 18:33 UTC
    As an alternate to tilly's suggestion, consider using DBM::Deep. One of its design requirements was to provide access to more data than could fit in memory. I'm not sure about the performance comparisons, but if you happen to be dealing with multi-level datastructures (HoAoHoAoA or something similar), nothing is going to beat DBM::Deep. In addition, DBM::Deep is (I believe) the only DBM that correctly handles Perl references. That may also be a consideration.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      So, does the act of putting a hash into DBM::Deep cause the keys to be sorted (that would make sense for an efficient on-disk format and would be a very welcome feature to me) or are you suggesting that using perl's sort on an array from DBM::Deep would efficiently swap parts in and out of memory such that the sorting would work reasonably fast and also not require that much memory? (update: I guess the latter would also require storing the sorted results into a DBM::Deep array, perhaps the same one)

      - tye        

        The hashkeys are not stored sorted, no. But, in doing a sort (and I could be wrong), Perl doesn't necessarily pull all the data into memory. If it does, then I would certainly accept help in getting Perl to do the sort in the way you're describing.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Sorting data that don't fit in memory
by roboticus (Chancellor) on Oct 12, 2007 at 11:37 UTC
    jeroenes:

    You might consider doing the work with an external sort program. In Unix / Cygwin you can use sort. You don't want to hand it a binary file at the start, but you're starting with text files anyway.

    ...roboticus

Re: Sorting data that don't fit in memory
by andreas1234567 (Vicar) on Oct 12, 2007 at 12:50 UTC
    I second the suggested database approach. Using your favorite database (BerkeleyDB, SQLLite, MySQL, PostgresQL, Oracle, IBM UDB, etc), apply indices to the sort columns, and do a SELECT .. INTO OUTFILE, dump or equivalent.

    I work on MySQL databases with single data files exceeding 0.5Gb every day, where indexed searched are more than fast enough for my purposes. Dumping whole tables of those sizes requires significant I/O, so be patient.

    --
    Andreas