in reply to Fast/Efficient Sort for Large Files

I have done this type of thing a lot on Solaris boxes, I need to know a few more bits of data from you to give you a fast working solution:

1, What type of sort are you using? Are you sorking on multiple "keys"? If so are all the sorts the same direction (all accending or mix and match). Give me an example from your posted dataset.

2, I need to know what your system usage looks like can you dump the second display of mpstat 1 2 while your perl version is running.

Let me know and I will be able to help.

Edited, I missed some info on your post, here is what I got just running a test on my devel Sun420r 2 proc 2gb ram box: 24 mil records in 20.25 minutes. Here is how I did it:
/usr/bin/sort -k1 -k2 -T /data2 -o t.sorted t
this says "sort on the first field then if there are duplicates sort on the second, use /data2 as a temp file area and output to t.sorted instead of stdout. You will need 4x <size of inputfile> space free on the temp working dir. Also your sortrate may be better than mine cause I had a lot of duplicate records which forced the secondary key sort to happen way more frequently. If you need to sort on 40000600045 first and then break ties with the first field use -k2 -k1. Let me know if you have questions.

-Waswas

Replies are listed 'Best First'.
Re: Re: Fast/Efficient Sort for Large Files
by waswas-fng (Curate) on Dec 19, 2002 at 17:58 UTC
    Just to add a few more points, that sort command above automaticaly does a split sort merge sort, it is still single threaded but I dont know that you will be able to get much faster by forking or threading it in perl. Also unlike the posts above I would say do _NOT_ use the mem limiter flag on the sort command as it makes sort way slower. Solaris's sort is smart enough to use as much memory as it can to make the sort faster -- and not try to do _everything_ in memory unless it can. Overall Solairs sort has proved to be very fast compaired to perl's sort except for the occasions where you must do >5 or 6 compound compairs row or do inline mods to the data.

    -Waswas