c-era has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a quick program to sort a log file, at the time it worked great, but now the log files are getting to be 80 - 100 MB before the sort is run. When I run my program I'm starting to get out of memory errors. At the rate we are growing the log files will grow by ~50 MB a month, and I can't split the log files to anything smaller. I know I could write a buble sort and keep the contents on the disk, but I don't want to thrash the disk that much. Does anyone know of a good way to sort large files?

Here is an example of the log file format:

[05:24:50.7] : Start 963365840704 0 Fri Feb 16, 2001 17:24 +:50.704 CST select FOO from BAR where foo_bar = 578525 order by FO +O [05:24:50.7] : End 963365840704 15 [05:24:50.7] : Start 982365890719 1 Fri Feb 16, 2001 17:24 +:50.719 CST select BAR from FOO where var_foo = 578525 order by BA +R [05:24:50.7] : End 982365890719 0
I just need the sql statements, but they need to be sorted. Below is the code I wrote:
print sort {$a cmp $b} map {(split /\t/,$_)[4]} grep {(substr ($_,15,5 +) eq "Start")?1:0}<>;

Replies are listed 'Best First'.
Re: Sorting a large file
by jeroenes (Priest) on Feb 21, 2001 at 18:41 UTC
    First of all, take a look at Sorting data that don't fit in memory. The BerkeleyDB solution is something that works for sure.

    Be warned about the memory usage of arrays and hashes in perl. I found out it takes 46 bytes for one array-item to be stored in memory. That's a lot. If your arrays are more than say 1Mb items long, you might run into trouble.

    You also can use a RDBM like postgres or mysql to manage the LOG. Efficient storage, sorting AND use.

    Maybe the memory problem lies in the fact that perl tries to create a duplicate array holding your data. Do you have twice the memory available before the sort starts?

    Hope this helps,

    Jeroen
    "We are not alone"(FZ)

      One problem that I should of stated above, I'm on Windows NT, using active perl (no BerkeleyDB). The database seems like an overkill for sorting a log file, but it could be an option.
        You might try to install MySQL (its free) on your NT box, write a script to monitor your log file for changes, import new files into the database; use SQL to sort the file.

        Celebrate Intellectual Diversity

Re: Sorting a large file
by clintp (Curate) on Feb 21, 2001 at 18:41 UTC
    Very old problem with an easy solution: you sort the file in chunks and the merge the chunks back together. This is called a merge sort. The chunks can be as large (or small) as you like.
      This was hacked up while I ate lunch today. Give it a maximum number of records per file and very large input source and awaaaaay you go.

      Use your own sortsub of course. And your own input records (I used random numbers). Otherwise, it's fit to use.

      #!/usr/bin/perl -w # Mergesort use IO::Handle; # For the ->getline require 5.6.0; # Sort sub prototypes $recs=13; # Total number of records to sort..... # Leave out of the real thing $max=5; # Maximum number of records per merge file @files=(); # The prototype is needed because we want lexical # values in the sort because we're using it as a # regular comparison and as a sort sub. sub sortsub ($$) { my($c,$d)=@_; return $c<=>$d; } { # Should be POSIX::tmpnam. But I'm lazy at the moment. # (Under UNIX you can even re-use the same name each # time and just unlink it after the push()!) $tempname="fooaa"; sub store { my($a)=@_; my $f; open($f, "+>/tmp/$tempname") || die; print $f sort sortsub @$a; # Sort small pile seek $f, 0, 0 or warn "Can't seek: $!"; push(@files, { fh => $f, queued => scalar <$f>, }); $tempname++; } } # This is where you'd read the input file to exhaustion # I'm just making up data. The important part is the block itself. while($_=rand() . "\n", $recs--) { push(@sortarr, $_); if (@sortarr==$max) { store(\@sortarr); @sortarr=(); } } store(\@sortarr) if @sortarr; # Store the leftovers LOOP: { ($lowest)=(sort { sortsub($a->{queued}, $b->{queued}); } grep(defined $_->{queued}, @files) )[0]; last unless defined $lowest->{queued}; # Do your processing here print $lowest->{queued}; $lowest->{queued}=$lowest->{fh}->getline(); redo; }
        As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code.

        Secondly for temporary files, use File::Temp instead of hand-rolling.

        And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way.

        As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies.

Re: Sorting a large file
by Beatnik (Parson) on Feb 21, 2001 at 19:24 UTC
Re: Sorting a large file (non-perl solution)
by Malkavian (Friar) on Feb 21, 2001 at 18:54 UTC
    In the vein of considering tools for jobs...
    Is Perl actually the right job for doing this? I know that:
    sort <file> | uniq
    gives a pretty good sort speed.
    The logs I deal with here are hundreds of megs long (or several gigs in many cases), and I have to deal with an awful lot of them.
    It works fine for me.. I know it's not Perl, but, it's a tool that does the job adequately for a lot of things..

    Just a thought,

    Malk.

    Update:
    For a set of windows tools (only just noticed you were using that), try this.
      Once again, it the windows thing, but if you know of a good sort for windows, that could be another option. The development tools we have are java and perl. The java guys couldn't get a decent sort to work under java, so they came to me to do it in perl.

      Update: Thanks for the link, I can get away with installing a couple of small programs, and it only uses a couple MB of memory.

        You can still use Malkavian's solution on Windows. Check out Cygwin or install it. OTOH it'll give you a lot more than just a sort program.
Re (tilly) 1: Sorting a large file
by tilly (Archbishop) on Feb 21, 2001 at 21:06 UTC
    You may wish to try File::Sort to do a merge-sort using limited memory without requiring external databases installed.

    I have not used it, just decided to look and see whether CPAN had anything to keep you from having to roll your own merge-sort from scratch.

Re: Sorting a large file
by Kickstart (Pilgrim) on Feb 21, 2001 at 21:30 UTC
    This is the line I used to sort ~30MB log files, based on the location of the section I wanted to sort by (the -k4 in the sort command). The standard sort command in Unix seems a whole lot faster than anything written in perl, which is why my script calls out to it.

    system("sort -T /home/tmp -k4 fixed.log > single.$wantdate.log");

    Kickstart

Re: Sorting a large file
by zakzebrowski (Curate) on Feb 21, 2001 at 20:05 UTC
    An idea: Use perl to import log file into mysql (or your favorite dbms) and use a perl script like
    Open file while !eof { <chomp> insert into log_file values (time,value,sql) }
    Then report using:
    select * from log_file order by sql
    Zak