Re: sorting very large text files

As salva just said, I'd be tempted to change the way you access the data. Naively (because this is not my bread and butter), I'd either create an index file of field -> file+line number and sort on that or pre-sort them into similar files so that there's less work to do.

Not that these are particularly brilliant ideas, but I'm just trying to illustrate the point that there's more than one way to crack this nut and to look in different directions.

best of luck,

perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?

Comment on Re: sorting very large text files

Replies are listed 'Best First'.
Re^2: sorting very large text files by salva (Canon) on Dec 21, 2009 at 11:36 UTC
I'd either create an index file of field -> file+line number and sort on that Unfortunately it is not as easy as that. If you sort an index containing just the sorting keys and the offsets, you will need a last step where you combine the sorted index with the original file to create the final full sorted output file. Doing this step straight ahead, just following the index and seeking into the original file to read every line, would be very, very, very inefficient. Roughly, (as estimated by BrowserUk) 165e6 records * 10ms per seek = 19 days!!! A work around is to create the final file in several passes reading the original file sequentially and generating an slice in every pass... not so easy!	[reply]

Replies are listed 'Best First'.

Re^2: sorting very large text files
by salva (Canon) on Dec 21, 2009 at 11:36 UTC

I'd either create an index file of field -> file+line number and sort on that

Unfortunately it is not as easy as that. If you sort an index containing just the sorting keys and the offsets, you will need a last step where you combine the sorted index with the original file to create the final full sorted output file.

Doing this step straight ahead, just following the index and seeking into the original file to read every line, would be very, very, very inefficient. Roughly, (as estimated by BrowserUk) 165e6 records * 10ms per seek = 19 days!!!

A work around is to create the final file in several passes reading the original file sequentially and generating an slice in every pass... not so easy!

[reply]