Sorting a (very) large file

condar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Sorting a (very) large file
by moritz (Cardinal) on Nov 30, 2007 at 19:31 UTC

1) The one that will always work is to split up the file in smaller chunks, sort the separately, and then merge them again.

2) If you work on a unixish system, you could try to abuse you system's sort program, at least GNU sort handles memory limitations automatically by creating temporary files.

3) Search on CPAN. Maybe there's a module that does what you want? File::Sort looks promising at a first glance (actually CPAN should be your first place to look ;-)

[reply]

Re^2: Sorting a (very) large file

by salva (Canon) on Dec 01, 2007 at 09:21 UTC

3) Search on CPAN. Maybe there's a module that does what you want?

Sort::External

[reply]

Re^3: Sorting a (very) large file

by creamygoodness (Curate) on Dec 02, 2007 at 03:44 UTC

Sort::External

exactly

Sort::External was my first CPAN distro, and the people on the modules mailing list prevailed upon me to release it it under the academically correct name. I occasionally regret having taken that advice.

FWIW, here's the definition of "external sort" from nist.gov:

Any sort algorithm that uses external memory, such as tape or disk, during the sort. Since most common sort algorithms assume high-speed random access to all intermediate memory, they are unsuitable if the values to be sorted don't fit in main memory.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

[reply]

Re^4: Sorting a (very) large file

by salva (Canon) on Dec 02, 2007 at 13:57 UTC

Re^5: Sorting a (very) large file (how not to search CPAN)

by tye (Sage) on Dec 02, 2007 at 16:52 UTC

Re: Sorting a (very) large file
by samtregar (Abbot) on Nov 30, 2007 at 19:47 UTC

  @lines = <INPUTFILE>;
  @lines =  sort { (split($a, /\t/))[2] <=> (split($b, /\t/))[2] } @li
+nes;
[download]

I believe the most recent Perls have a special in-place sort() optimization when you're sorting an array and assigning it back to itself. (UPDATE: Yup, it was added in v5.8.4)

-sam

[reply]
[d/l]

Re^2: Sorting a (very) large file (better*2)

by tye (Sage) on Nov 30, 2007 at 20:05 UTC

Or you could try something twice as fast that uses much less memory. In this case I'd likely sort parallel arrays (though it isn't too complicated to do something even faster that uses even less memory, such as fast, flexible, stable sort).

my @size= map { ( split /\t/, $_ )[2] } @in;
my @idx= sort { $size[$a] <=> $size[$b] }, 0..$#in;
@in= @in[@idx];
[download]

And I'd check how much paging space is configured. It sounds like that could be increased.

- tye

[reply]
[d/l]

Re^3: Sorting a (very) large file (better*2)

by samtregar (Abbot) on Nov 30, 2007 at 20:13 UTC

@in = @in[@idx];

Or do you happen to know that Perl does that in-place on @in? I guess the only way to really know would be to try it, but unfortunately I do have some real work to do...

-sam

[reply]
[d/l]

Re^4: Sorting a (very) large file (better*2)

by tye (Sage) on Nov 30, 2007 at 20:15 UTC

Re^5: Sorting a (very) large file (better*2)

by samtregar (Abbot) on Nov 30, 2007 at 20:19 UTC

Some notes below your chosen depth have not been shown here

Re: Sorting a (very) large file
by metaperl (Curate) on Nov 30, 2007 at 19:51 UTC

SQLITE

I have beheld the tarball of 22.1 on ftp.gnu.org with my own eyes. How can you say that there is no God in the Church of Emacs? -- David Kastrup

[tag://sort,etl,data]

[reply]

Re: Sorting a (very) large file
by Limbic~Region (Chancellor) on Nov 30, 2007 at 19:50 UTC

condar

#!/usr/bin/perl
use strict;
use warnings;

my %data;
my $file = $ARGV[0] or die "Usage: $0 <input_file>";
open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $
+!";

my $pos = tell $fh;
while (<$fh>) {
    my $size = (split /\t/)[2];
    $data{$size} = exists $data{$size}
              ? $data{$size} . ":$pos"
              : $pos;
    $pos = tell $fh;
}
for my $size (sort {$b <=> $a} keys %date) {
    for my $pos (split /:/, $data{$size}) {
        seek($fh, $pos, 0);
        my $line = <$fh>;
        print $line;
    }
}
[download]

Note: This code is untested an is not optimal. It was just an idea I had that you might want to try. It will output lines of files with the same size in the same order they appeared in the original file.

Cheers - L~R

Inlined get_size() sub which was just a call to split. Also specified keys %date

[reply]
[d/l]

Re: Sorting a (very) large file
by locked_user sundialsvc4 (Abbot) on Nov 30, 2007 at 21:22 UTC

400MB is not a “very large” file. It is “not a very large file at all.”

What you need to do is use a disk-based sort. There are many of these in CPAN, such as File::Sort. Or, how aboutSort::External? As it happens, “sorting” is one of the most extensively-studied and extensively-implemented algorithms in all of computer science. (The other is “searching,” saith the august Dr. Knuth.) My CPAN-search found 1,929 choices under “Sort.”

Although virtual-memory sorting is simple and easy to conceptualize, it is not intended for large volumes of data. You're seeing it choke, and fail, trying to do a task that is much larger than it was designed for.

Disk-oriented tools, of which there are a great many available, will automatically handle the job by splitting the total job into a group of disk-based spill files, then merging those files together for you. An arbitrary amount of data can be sorted in this way and it will be quite fast.

(Heck, you can even sort data using nothing more than magnetic tape! When all those campy old sci-fi movies needed to film “a computer,” the operators would obligingly start a tape-sort.)

Re: Sorting a (very) large file
by jrsimmon (Hermit) on Nov 30, 2007 at 19:42 UTC

The example you posted appears to be the output of a "dir" command from a dos-like system. Have you considered sorting the data as you capture it ("dir /OS" on winXP)?

[reply]

Re: Sorting a (very) large file
by runrig (Abbot) on Nov 30, 2007 at 20:32 UTC

moritz

jrsimmon

MSYS

> # The file is tab delimited (your time field having spaces shouldn't
+ matter)
> cat tmp.txt
3       1a      2       b
2       2c      3       b
1       3b      1       b
> # The char between the quotes is a tab ("\t" works in ksh)
> sort -t" " -k3n tmp.txt
1       3b      1       b
3       1a      2       b
2       2c      3       b
>
[download]

system('sort',"-t", "\t","-k3n","tmp.txt");
[download]

[reply]
[d/l]
[select]