Sorting a large file

c-era has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting a large file by jeroenes (Priest) on Feb 21, 2001 at 18:41 UTC
First of all, take a look at Sorting data that don't fit in memory. The BerkeleyDB solution is something that works for sure. Be warned about the memory usage of arrays and hashes in perl. I found out it takes 46 bytes for one array-item to be stored in memory. That's a lot. If your arrays are more than say 1Mb items long, you might run into trouble. You also can use a RDBM like postgres or mysql to manage the LOG. Efficient storage, sorting AND use. Maybe the memory problem lies in the fact that perl tries to create a duplicate array holding your data. Do you have twice the memory available before the sort starts? Hope this helps, Jeroen "We are not alone"(FZ)	[reply]
Re: Re: Sorting a large file by c-era (Curate) on Feb 21, 2001 at 18:48 UTC
One problem that I should of stated above, I'm on Windows NT, using active perl (no BerkeleyDB). The database seems like an overkill for sorting a log file, but it could be an option.	[reply]
Re: Re: Re: Sorting a large file by InfiniteSilence (Curate) on Feb 21, 2001 at 20:20 UTC
You might try to install MySQL (its free) on your NT box, write a script to monitor your log file for changes, import new files into the database; use SQL to sort the file. Celebrate Intellectual Diversity	[reply]
Re: Sorting a large file by clintp (Curate) on Feb 21, 2001 at 18:41 UTC
Very old problem with an easy solution: you sort the file in chunks and the merge the chunks back together. This is called a merge sort. The chunks can be as large (or small) as you like.	[reply]
Re: Re: Sorting a large file by clintp (Curate) on Feb 21, 2001 at 23:16 UTC
This was hacked up while I ate lunch today. Give it a maximum number of records per file and very large input source and awaaaaay you go. Use your own sortsub of course. And your own input records (I used random numbers). Otherwise, it's fit to use. #!/usr/bin/perl -w # Mergesort use IO::Handle; # For the ->getline require 5.6.0; # Sort sub prototypes $recs=13; # Total number of records to sort..... # Leave out of the real thing $max=5; # Maximum number of records per merge file @files=(); # The prototype is needed because we want lexical # values in the sort because we're using it as a # regular comparison and as a sort sub. sub sortsub ($$) { my($c,$d)=@_; return $c<=>$d; } { # Should be POSIX::tmpnam. But I'm lazy at the moment. # (Under UNIX you can even re-use the same name each # time and just unlink it after the push()!) $tempname="fooaa"; sub store { my($a)=@_; my $f; open($f, "+>/tmp/$tempname") \|\| die; print $f sort sortsub @$a; # Sort small pile seek $f, 0, 0 or warn "Can't seek: $!"; push(@files, { fh => $f, queued => scalar <$f>, }); $tempname++; } } # This is where you'd read the input file to exhaustion # I'm just making up data. The important part is the block itself. while($_=rand() . "\n", $recs--) { push(@sortarr, $_); if (@sortarr==$max) { store(\@sortarr); @sortarr=(); } } store(\@sortarr) if @sortarr; # Store the leftovers LOOP: { ($lowest)=(sort { sortsub($a->{queued}, $b->{queued}); } grep(defined $_->{queued}, @files) )[0]; last unless defined $lowest->{queued}; # Do your processing here print $lowest->{queued}; $lowest->{queued}=$lowest->{fh}->getline(); redo; } [download]	[reply] [d/l]
Re (tilly) 3: Sorting a large file by tilly (Archbishop) on Feb 22, 2001 at 00:02 UTC
As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code. Secondly for temporary files, use File::Temp instead of hand-rolling. And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way. As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies.	[reply]
Re: Re (tilly) 3: Sorting a large file by clintp (Curate) on Feb 22, 2001 at 02:58 UTC
Re (tilly) 5: Sorting a large file by tilly (Archbishop) on Feb 22, 2001 at 05:18 UTC
Re: Re: Re (tilly) 3: Sorting a large file by chipmunk (Parson) on Feb 22, 2001 at 10:25 UTC
Re: Sorting a large file by Beatnik (Parson) on Feb 21, 2001 at 19:24 UTC
The Far More Than Everything You've Ever Wanted to Know About Sorting page probably has some nice thoughts. Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re: Sorting a large file (non-perl solution) by Malkavian (Friar) on Feb 21, 2001 at 18:54 UTC
In the vein of considering tools for jobs... Is Perl actually the right job for doing this? I know that: sort <file> \| uniq gives a pretty good sort speed. The logs I deal with here are hundreds of megs long (or several gigs in many cases), and I have to deal with an awful lot of them. It works fine for me.. I know it's not Perl, but, it's a tool that does the job adequately for a lot of things.. Just a thought, Malk. Update: For a set of windows tools (only just noticed you were using that), try this.	[reply]
Re: Re: Sorting a large file (non-perl solution) by c-era (Curate) on Feb 21, 2001 at 19:00 UTC
Once again, it the windows thing, but if you know of a good sort for windows, that could be another option. The development tools we have are java and perl. The java guys couldn't get a decent sort to work under java, so they came to me to do it in perl. Update: Thanks for the link, I can get away with installing a couple of small programs, and it only uses a couple MB of memory.	[reply]
Re: Re: Re: Sorting a large file (non-perl solution) by Tyke (Pilgrim) on Feb 21, 2001 at 19:17 UTC
You can still use Malkavian's solution on Windows. Check out Cygwin or install it. OTOH it'll give you a lot more than just a sort program.	[reply]
Re: Re: Re: Re: Sorting a large file (non-perl solution) by c-era (Curate) on Feb 21, 2001 at 19:36 UTC
Re**n: Sorting a large file (non-perl solution) by Tyke (Pilgrim) on Feb 21, 2001 at 19:47 UTC
Re (tilly) 1: Sorting a large file by tilly (Archbishop) on Feb 21, 2001 at 21:06 UTC
You may wish to try File::Sort to do a merge-sort using limited memory without requiring external databases installed. I have not used it, just decided to look and see whether CPAN had anything to keep you from having to roll your own merge-sort from scratch.	[reply]
Re: Sorting a large file by Kickstart (Pilgrim) on Feb 21, 2001 at 21:30 UTC
This is the line I used to sort ~30MB log files, based on the location of the section I wanted to sort by (the -k4 in the sort command). The standard sort command in Unix seems a whole lot faster than anything written in perl, which is why my script calls out to it. `system("sort -T /home/tmp -k4 fixed.log > single.$wantdate.log");` [download] Kickstart	[reply] [d/l]
Re: Sorting a large file by zakzebrowski (Curate) on Feb 21, 2001 at 20:05 UTC
An idea: Use perl to import log file into mysql (or your favorite dbms) and use a perl script like `Open file while !eof { <chomp> insert into log_file values (time,value,sql) }` [download] Then report using: `select * from log_file order by sql` [download] Zak	[reply] [d/l] [select]