Re: Sorting a (very) large file
by moritz (Cardinal) on Nov 30, 2007 at 19:31 UTC
|
There are three things you could try.:
1) The one that will always work is to split up the file in smaller chunks, sort the separately, and then merge them again.
2) If you work on a unixish system, you could try to abuse you system's sort program, at least GNU sort handles memory limitations automatically by creating temporary files.
3) Search on CPAN. Maybe there's a module that does what you want? File::Sort looks promising at a first glance (actually CPAN should be your first place to look ;-) | [reply] |
|
3) Search on CPAN. Maybe there's a module that does what you want?
Sort::External
| [reply] |
|
Y'know, Sort::External is designed for exactly what the original poster wants to do... but few people seem to find it on CPAN.
Sort::External was my first CPAN distro, and the people on the modules mailing list prevailed upon me to release it it under the academically correct name. I occasionally regret having taken that advice.
FWIW, here's the definition of "external sort" from nist.gov:
Any sort algorithm that uses external memory, such as tape or disk, during the sort. Since most common sort algorithms assume high-speed random access to all intermediate memory, they are unsuitable if the values to be sorted don't fit in main memory.
| [reply] |
|
|
Re: Sorting a (very) large file
by samtregar (Abbot) on Nov 30, 2007 at 19:47 UTC
|
You probably do have enough memory to sort the file, but you need to be careful about how you handle it. Doing a Schwartzian transform on it like that is creating huge temporary copies, which is probably what's blowing through your memory. You might try something slower that uses less memory:
@lines = <INPUTFILE>;
@lines = sort { (split($a, /\t/))[2] <=> (split($b, /\t/))[2] } @li
+nes;
I believe the most recent Perls have a special in-place sort() optimization when you're sorting an array and assigning it back to itself. (UPDATE: Yup, it was added in v5.8.4)
-sam
| [reply] [d/l] |
|
Or you could try something twice as fast that uses much less memory. In this case I'd likely sort parallel arrays (though it isn't too complicated to do something even faster that uses even less memory, such as fast, flexible, stable sort).
my @size= map { ( split /\t/, $_ )[2] } @in;
my @idx= sort { $size[$a] <=> $size[$b] }, 0..$#in;
@in= @in[@idx];
And I'd check how much paging space is configured. It sounds like that could be increased.
| [reply] [d/l] |
|
| [reply] [d/l] |
|
|
|
Re: Sorting a (very) large file
by metaperl (Curate) on Nov 30, 2007 at 19:51 UTC
|
I would toss the data in a SQLITE database and let it do the dirty work.
I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes.
How can you say that there is no God in the Church of Emacs? -- David Kastrup
|
[tag://sort,etl,data]
|
| [reply] |
Re: Sorting a (very) large file
by Limbic~Region (Chancellor) on Nov 30, 2007 at 19:50 UTC
|
condar,
There have been some ideas already from the chatterbox. Here is an idea that I had while I was thinking about it:
#!/usr/bin/perl
use strict;
use warnings;
my %data;
my $file = $ARGV[0] or die "Usage: $0 <input_file>";
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";
my $pos = tell $fh;
while (<$fh>) {
my $size = (split /\t/)[2];
$data{$size} = exists $data{$size}
? $data{$size} . ":$pos"
: $pos;
$pos = tell $fh;
}
for my $size (sort {$b <=> $a} keys %date) {
for my $pos (split /:/, $data{$size}) {
seek($fh, $pos, 0);
my $line = <$fh>;
print $line;
}
}
Note: This code is untested an is not optimal. It was just an idea I had that you might want to try. It will output lines of files with the same size in the same order they appeared in the original file.
Inlined get_size() sub which was just a call to split. Also specified keys %date
| [reply] [d/l] |
Re: Sorting a (very) large file
by locked_user sundialsvc4 (Abbot) on Nov 30, 2007 at 21:22 UTC
|
400MB is not a “very large” file. It is “not a very large file at all.”
What you need to do is use a disk-based sort. There are many of these in CPAN, such as File::Sort. Or, how aboutSort::External? As it happens, “sorting” is one of the most extensively-studied and extensively-implemented algorithms in all of computer science. (The other is “searching,” saith the august Dr. Knuth.) My CPAN-search found 1,929 choices under “Sort.”
Although virtual-memory sorting is simple and easy to conceptualize, it is not intended for large volumes of data. You're seeing it choke, and fail, trying to do a task that is much larger than it was designed for.
Disk-oriented tools, of which there are a great many available, will automatically handle the job by splitting the total job into a group of disk-based spill files, then merging those files together for you. An arbitrary amount of data can be sorted in this way and it will be quite fast.
(Heck, you can even sort data using nothing more than magnetic tape! When all those campy old sci-fi movies needed to film “a computer,” the operators would obligingly start a tape-sort.)
| |
Re: Sorting a (very) large file
by jrsimmon (Hermit) on Nov 30, 2007 at 19:42 UTC
|
The example you posted appears to be the output of a "dir" command from a dos-like system. Have you considered sorting the data as you capture it ("dir /OS" on winXP)? | [reply] |
Re: Sorting a (very) large file
by runrig (Abbot) on Nov 30, 2007 at 20:32 UTC
|
moritz already mentioned command line sort...here's how to do it (if you are on windows( as jrsimmon believes), go get MSYS): > # The file is tab delimited (your time field having spaces shouldn't
+ matter)
> cat tmp.txt
3 1a 2 b
2 2c 3 b
1 3b 1 b
> # The char between the quotes is a tab ("\t" works in ksh)
> sort -t" " -k3n tmp.txt
1 3b 1 b
3 1a 2 b
2 2c 3 b
>
Update: I do not claim that the above quotes are properly escaped for Windows. Actually I'm not even sure how to escape them in the cmd shell, but this works from perl:system('sort',"-t", "\t","-k3n","tmp.txt");
| [reply] [d/l] [select] |