Re: Sorting a large file
by jeroenes (Priest) on Feb 21, 2001 at 18:41 UTC
|
First of all, take a look at Sorting data that don't fit in memory. The BerkeleyDB solution is something that works for sure.
Be warned about the memory usage of arrays and hashes in perl. I found out it takes 46 bytes for one array-item to be stored in memory. That's a lot. If your arrays are more than say 1Mb items long, you might run into trouble.
You also can use a RDBM like postgres or mysql to manage the LOG. Efficient storage, sorting AND use.
Maybe the memory problem lies in the fact that perl tries to create a duplicate array holding your data. Do you have twice the memory available before the sort starts?
Hope this helps, Jeroen
"We are not alone"(FZ) | [reply] |
|
|
One problem that I should of stated above, I'm on Windows NT, using active perl (no BerkeleyDB). The database seems like an overkill for sorting a log file, but it could be an option.
| [reply] |
|
|
| [reply] |
Re: Sorting a large file
by clintp (Curate) on Feb 21, 2001 at 18:41 UTC
|
Very old problem with an easy solution: you sort the file in chunks and the merge
the chunks back together. This is called a merge sort.
The chunks can be as large (or small) as you like. | [reply] |
|
|
#!/usr/bin/perl -w
# Mergesort
use IO::Handle; # For the ->getline
require 5.6.0; # Sort sub prototypes
$recs=13; # Total number of records to sort.....
# Leave out of the real thing
$max=5; # Maximum number of records per merge file
@files=();
# The prototype is needed because we want lexical
# values in the sort because we're using it as a
# regular comparison and as a sort sub.
sub sortsub ($$) { my($c,$d)=@_; return $c<=>$d; }
{
# Should be POSIX::tmpnam. But I'm lazy at the moment.
# (Under UNIX you can even re-use the same name each
# time and just unlink it after the push()!)
$tempname="fooaa";
sub store {
my($a)=@_;
my $f;
open($f, "+>/tmp/$tempname") || die;
print $f sort sortsub @$a; # Sort small pile
seek $f, 0, 0 or warn "Can't seek: $!";
push(@files, {
fh => $f,
queued => scalar <$f>,
});
$tempname++;
}
}
# This is where you'd read the input file to exhaustion
# I'm just making up data. The important part is the block itself.
while($_=rand() . "\n", $recs--) {
push(@sortarr, $_);
if (@sortarr==$max) {
store(\@sortarr);
@sortarr=();
}
}
store(\@sortarr) if @sortarr; # Store the leftovers
LOOP: {
($lowest)=(sort {
sortsub($a->{queued}, $b->{queued});
} grep(defined $_->{queued}, @files) )[0];
last unless defined $lowest->{queued};
# Do your processing here
print $lowest->{queued};
$lowest->{queued}=$lowest->{fh}->getline();
redo;
}
| [reply] [d/l] |
|
|
As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code.
Secondly for temporary files, use File::Temp
instead of hand-rolling.
And thirdly the fact that you defined your store function
inside of your sortsub function suggests some confusion
on your part. You cannot get nested functions that way.
As a matter of personal taste I would drop the prototype,
use strict, etc, etc, etc. But the three I gave above
are the biggies.
| [reply] |
|
|
|
|
|
|
Re: Sorting a large file
by Beatnik (Parson) on Feb 21, 2001 at 19:24 UTC
|
| [reply] |
Re: Sorting a large file (non-perl solution)
by Malkavian (Friar) on Feb 21, 2001 at 18:54 UTC
|
In the vein of considering tools for jobs...
Is Perl actually the right job for doing this? I know that:
sort <file> | uniq
gives a pretty good sort speed. The logs I deal with here are hundreds of megs long (or several gigs in many cases), and I have to deal with an awful lot of them.
It works fine for me.. I know it's not Perl, but, it's a tool that does the job adequately for a lot of things..
Just a thought,
Malk.
Update:
For a set of windows tools (only just noticed you were using that), try this.
| [reply] |
|
|
| [reply] |
|
|
You can still use Malkavian's solution on Windows. Check
out Cygwin or
install it.
OTOH it'll give you a lot more than just a sort program.
| [reply] |
|
|
|
|
Re (tilly) 1: Sorting a large file
by tilly (Archbishop) on Feb 21, 2001 at 21:06 UTC
|
You may wish to try File::Sort to do a
merge-sort using limited memory without requiring
external databases installed.
I have not used it, just decided to look and see whether
CPAN had anything to keep you from having to roll your own
merge-sort from scratch. | [reply] |
Re: Sorting a large file
by Kickstart (Pilgrim) on Feb 21, 2001 at 21:30 UTC
|
This is the line I used to sort ~30MB log files, based on the location of the section I wanted to sort by (the -k4 in the sort command). The standard sort command in Unix seems a whole lot faster than anything written in perl, which is why my script calls out to it.
system("sort -T /home/tmp -k4 fixed.log > single.$wantdate.log");
Kickstart | [reply] [d/l] |
Re: Sorting a large file
by zakzebrowski (Curate) on Feb 21, 2001 at 20:05 UTC
|
An idea: Use perl to import log file into mysql (or your favorite dbms) and use a perl script like
Open file
while !eof
{
<chomp>
insert into log_file values (time,value,sql)
}
Then report using:
select * from log_file order by sql
Zak | [reply] [d/l] [select] |