Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to sort a very large text file where the sort key is composed of multiple columns which are strings and numerics. If I sort using unix sort I sort it in about 10 minutes without hogging the memory. However, in unix I cannot configure the input record separator. So if the data comes with embedded \n charcters in the middle then this breaks the sort. So I wrote a Perl sort based on a hash tied to a btree (DB_File CPAN module) and localizing $/ to two characters ascii char(164) followed by \n. I reproduced the unsorted file to have these two charcters at the end of each line. This works but the sort takes about 40 minutes on a Linux machine with 8 CPUs (4 CPUs with HyperThreading) and 15GB of memory. During this time interval the memory is eventually hogged completly so that other process fail to start until this sort is done. Is there the equivalent of $/ in Unix? or Is there an equivalent Perl sort that is as efficient as the Unix sort? here is the sorting code just in case you are wondering how I am doing this:
sub btree{ my ($disk,$workdir,$outFileName,$sep,$keyMap_aref,$verbose,$infile +_aref,$indir,$outdir,$splitSize,$keyPos_aref,$outCol_aref)=@_; my @keyMap=@{$keyMap_aref}; my @infile=@{$infile_aref}; my $outfile=$outdir.'/'.$outFileName; my @keyPos=@{$keyPos_aref}; my @outCol=@{$outCol_aref} if defined $outCol_aref; my $idx; my %h ; # generate a ref to anonymous sub for parsing key from data line # ################################# my $parse; #my $tmp=$splitSize-2; my $list=join ',',@keyPos; #my $pcode='$parse=sub {my($line)=@_;my $key=join(\''."$sep".'\',( +(split(/'."$sep".'/,$line,'."$splitSize".'))['."$list".']));return $k +ey}'; my $pcode='$parse=sub {my($line)=@_;join(\''."$sep".'\',((split(/' +."$sep".'/,$line,'."$splitSize".'))['."$list".']));};'; print "pcode: $pcode\n" if $verbose; eval $pcode; # ################################# if($disk){ $idx="$workdir".'/'."$outFileName".'.idx'; if(-f $idx){ unlink "$idx" or croak "cannot unlik $idx: $!\n"; } } # create the btree object # ################################# my $t = '$DB_BTREE->{\'compare\'} = '.genComp($sep,\@keyMap); print "sort criteria: $t\n" if $verbose; eval $t; my $mybtree; if ($disk){ $mybtree=tie %h, "DB_File", "$idx", O_RDWR|O_CREAT, 0666, $DB_ +BTREE or croak "Cannot open file $idx: $!\n" ; } else{ $mybtree=tie %h, "DB_File", undef, O_RDWR|O_CREAT, 0666, $DB_B +TREE ; } # lets do it. lets fill up the btree object # ################################# my $bench0=new Benchmark; # Add a key/value pair to the file unless(defined $outCol_aref){ foreach my $infile(@infile){ print "infile=$infile\n" if $verbose; my $fh = new IO::File "$indir".'/'."$infile", "r" or +croak "Cannot open file $infile: $!\n"; while(not $fh->eof){ my $line=<$fh>; my $key=$parse->($line); $h{$key}=$line; } $fh->close(); } } else{ foreach my $infile(@infile){ print "infile=$infile\n" if $verbose; my $fh = new IO::File "$indir".'/'."$infile", "r" or +croak "Cannot open file $infile: $!\n"; while(not $fh->eof){ my $line=<$fh>; my $key=$parse->($line); $h{$key}=join $sep,((split(/$sep/,$line,$splitSize +))[@outCol]); } $fh->close(); } } # Cycle through the keys printing them in order. # ################################# my $fh = new IO::File "$outfile", "w"; my $bench1=new Benchmark; my($key,$value); for (my $status = $mybtree->seq($key, $value, R_FIRST) ; $status == 0 ; $status = $mybtree->seq($key, $value, R_NEXT) ){ chomp $value; print $fh "$value".$/; } my $bench2=new Benchmark; # done $fh->close; untie %h ; my $diff=timediff($bench1,$bench0); print "sort ".timestr($diff)."\n" if $verbose; my $diff=timediff($bench2,$bench1); print "write ".timestr($diff)."\n" if $verbose; my $diff=timediff($bench2,$bench0); print "total ".timestr($diff)."\n" if $verbose; } sub genComp{ my ($sep,$keyMap_ref)=@_; my @keyMap=@{$keyMap_ref}; my $code = 'sub {'; $code .= 'my($k1,$k2)=@_; my @k1=split /'."$sep".'/,$k1; my + @k2=split /'."$sep".'/,$k2;'; #$code .= '"$k1[0]" cmp "$k2[0]" || "$k1[1]" <=> "$k2[1]" | +| "$k1[2]" <=> "$k2[2]";'; for (my $i=0; $i <= $#keyMap; $i++){ $code .= '"$k1['; $code .= $i; $code .= ']" '; $code .= $keyMap[$i] eq 'C' ? 'cmp' : '<=>'; $code .= ' "$k2['; $code .= $i; $code .= ']" '; $code .= ' || ' if $i != $#keyMap; } $code .= ';'; $code .= '}'; }

update (broquaint): added <readmore> tag

Replies are listed 'Best First'.
Re: perl sort versus Unix sort
by graff (Chancellor) on Jan 26, 2004 at 01:25 UTC
    You said:
    in unix I cannot configure the input record separator. So if the data comes with embedded \n charcters in the middle then this breaks the sort.

    Is it essential that you preserve these "embedded \n" characters? Whatever the answer, if you can get the unix sort process to create exactly the ordering you want, you could use Perl just to "normalize" the records so that they are all single-line and well-behaved when they go through the unix sort. (But unless you're using Gnu sort, you might still have a problem if some of the records end up being too long -- I think some flavors of unix sort may still have a limit of 1024 bytes per line). (update: For some reason, I feared that solaris might be one such limited flavor, but I was wrong -- I could pump lines of >8200 bytes through /usr/bin/sort with no loss of data. Still, if you're not on linux or solaris 8 or better, test it first.)

    You are already handling record-based input by setting $/ in perl, so why not try a pipeline like this (I'm not sure if your reference to "164" was a decimal or octal value -- best to use hex and not worry about this ambiguity; I'll guess that you meant decimal):

    perl -pe 'BEGIN {$/="\xa4\n"} s/(?<!\xa4)\n/ /g' | sort ...
    Or, if you want to preserve these "extra" line-feeds, replace them with some character or string that doesn't naturally occur in the data; then, after the sort is done, do another one-liner to re-convert these back to "\n".
Re: perl sort versus Unix sort
by BrowserUk (Patriarch) on Jan 26, 2004 at 04:15 UTC

    Like are anonymous monk above, I'm have great difficulty in seeing how this code is consuming large quantities of memory.

    The only possibility I see is if the array of filenames you are passing by reference in $infile_aref is huge?

    If this is the case, then passing by ref is a goood thing, but you immediately undo that by duplicating it's contents into a local (my'd) array along with several others:

    my @keyMap=@{$keyMap_aref}; my @infile=@{$infile_aref}; my $outfile=$outdir.'/'.$outFileName; my @keyPos=@{$keyPos_aref}; my @outCol=@{$outCol_aref} if defined $outCol_aref;

    The only benefit of which I can see is that it allows you to write

    foreach my $infile(@infile){

    Instead of

    foreach my $infile( @{$infile_aref} ){

    If these arrays are not very big, then that won't be the source of your problem, but it's the only possibility I can see. Otherwise, the memory consumption is occuring outside the auspices of the code you've shown us.

    As far as performance is concerned. All your cpu's are benefiting you very little as the data transfer is going to a single file, through a very small cache (unless you have already increased this?). As such, your process is likely IO bound, only utilising a single cpu, and spending most of it's time in IO_wait states.

    Frankly, building a B-tree of all your data just in order to sort it, given that you are then writing the sorted data back to a flat file, makes no sense at all. B-trees are very fast for searching an traversing, but they are expensive to build. Especially if the ordering function requires a call-back into perl code and utilises a multi-level comparison to boot.

    You would be sustantially better off using a merge sort algorithm if your individual files are small enough to allow them to be sorted entirely in memory.

    Though, if your system sort can handle the record lengths and volumes of data you are dealing with, and the embedded newlines is the only problem, then converting those embedded newlines to some 'otherwise non-occuring token', using the system sort, and converting the tokens back after makes the most sense of all.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!

Re: perl sort versus Unix sort
by parv (Parson) on Jan 26, 2004 at 02:36 UTC

    This reply does not solve OP's problem, but... A bit of white space would have been much appreciated in the first line of btree() (which assigns TONS of variables) and the $pcode assignment. Something like...

    sub btree { # I personally would much prefer hash reference(s) here my ( $disk , $workdir , $outFileName , $sep , $keyMap_aref , $verbose , $infile_aref , $indir , $outdir , $splitSize , $keyPos_aref , $outCol_aref ) = @_ ; ... # No need for "$pcode" & later "eval $pcode" # Use "$parse->($line)" as you already have my $parse = make_parse_code( $sep , $splitSize , $keyPos_aref ); ... } sub make_parse_code { my ($sep , $size , $items) = @_; return sub { my ($line) = @_; join $sep , ( (split /$sep/ , $line , $size)[@$items] ); }; }

    ...adjust to your liking. Well, lack of white space was bugging me much, so i adjusted the code; i am sure others [cw]ould point out other improvements/suggestions.

Re: perl sort versus Unix sort
by Anonymous Monk on Jan 26, 2004 at 03:37 UTC
    I don't see how the code you presented could be possibly hogging your memory (that problem lies elsewhere).

    Try upgrading the db version and recompiling DB_File as well as increasing the cache to alleviate the speed issue (you can't do too much, cause file-io is only so fast).

Re: perl sort versus Unix sort
by bluto (Curate) on Jan 26, 2004 at 17:01 UTC
    I've found that sorting large files (i.e. millions of lines of text, 100s of MB in size) in practice is very hard to do efficiently in Perl. It's possible to get solutions, e.g. via CPAN modules, that work for certains types of data or certain data set sizes. It is also easy to introduce memory/CPU/disk hogging behavior as well, and then spending lots of time tracking them down.

    If you don't need portability to platforms that don't support using 'sort', I'd suggest following graff's advice about transforming the data and then using Unix's sort command. This kind of implementation is straightforward, quick, and generally robust.