in reply to Re^8: Sort big text file - byte offset - 50% there (Added code)
in thread Sort big text file - byte offset - 50% there

Ok, I hit a snag. I am working on sorting a big 4.42 GB (4,757,131,264 bytes) 13961993 line file. When the sorted file gets written, it's always smaller than the original file. This only happens on these huge files. My thinking is that the offset is creating a number larger than what pack N can deal with.

push @index, pack 'NN', $epoch, $offset;
Can someone confirm this and/or let me know if there is another way? Thanks

<edit> Ok, I just figured out that my guess was correct.

#!/usr/bin/perl -w use strict; my $num = 4757131264; print "NUM: ".$num."\n"; my $pnum = pack 'N', $num; print "NUM: ".unpack( 'N',$pnum)."\n";
How else can I pack the data and maintain the fast sort? My system does not support 64-bit
</edit>

Replies are listed 'Best First'.
Re^10: Sort big text file - byte offset
by BrowserUk (Patriarch) on Aug 31, 2006 at 22:46 UTC

    Use 'd', it will handle integers upto 253 without loss of accuracy which means your set for files upto ~8,000,000 GB, which should give you room for a little future growth. Be aware that your index will require more memory.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^10: Sort big text file - byte offset
by BrowserUk (Patriarch) on Sep 02, 2006 at 15:45 UTC

    It was pointed out that whilst using 'd' to pack your offsets will work, when it comes to doing the sorting, you would have to unpack the values and do a numeric comparison as sorting numeric data in it's pack'd binary for doesn't work for packed floats/doubles.

    Having previously extolled the virtues of sorting numeric data in it's binary form, that doesn't sit right, so here are a couple of routines that will pack and unpack a FP value < 2**53 to an 8-byte binary form that is sortable. along with some rudimentary tests:

    #! perl -slw use strict; use Math::Random::MT qw[ rand ]; sub ftob64 { return pack 'NN', int( $_[ 0 ] / 2**32 ), int( $_[ 0 ] % 2**32 ); } sub b64tof { my( $hi, $lo ) = unpack 'NN', $_[ 0 ]; return $hi * 2**32 + $lo; } for ( 1 .. 1e6 ) { my $test = int( rand 2**53 ); my $b64 = ftob64 $test; my $float = b64tof $b64; if( abs( $test - $float ) > 1e-15 ) { printf "%31.f v %31.f => diff %31.31f\n", $test, $float, abs( $test - $float ); } } my @randomBin = map{ ftob64 int rand 2**53 } 1 .. 1e6; my @sortedBin = sort @randomBin; my @sortedN = map{ b64tof $_ } @sortedBin; $sortedN[ $_ ] > $sortedN[ $_ + 1 ] and die "Error: $_ : $sortedN[ $_ ] > $sortedN[ $_ + 1 ]" for 0 .. $#sortedN - 1;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.