Re: join on 6 huge files

If you are certain that the SHA entries are going to be in the same order in all six of the files then reading from each file, one entry at a time, would probably be best.

If the SHA entries in the aren't in the same order, then I would try to cat the files together, sort them with the sort tool, (which can probably sort 56 million lines much faster than perl), and then let a script read the sorted result.

Slurping shouldn't be necessary... just keep an array of hex strings for the current SHA number and dump it to the out file when you hit a new SHA.

This is just a SWAG, though, and I can't guarantee that sort won't get pathological on you with 56 million lines.

Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"

Comment on Re: join on 6 huge files Select or Download Code

Replies are listed 'Best First'.
Re^2: join on 6 huge files by ambrus (Abbot) on Jun 10, 2004 at 21:07 UTC
You don't even need to resort the files if they are already sorted, as sort can merge sorted files: From (textutils.info)sort: `sort' has three modes of operation: sort (the default), merge, and check for sortedness. The following options change the operation mode: `-m' Merge the given files by sorting them as a group. Each input file must always be individually sorted. It always works to sort instead of merge; merging is provided because it is faster, in the case where it works. This means you won't have all six files open at once: you open two for reading, merge them into a temporary file, merge the third with the result etc.	[reply]
Re^2: join on 6 huge files by pbeckingham (Parson) on Jun 10, 2004 at 13:46 UTC
...I can't guarantee that sort won't get pathological on you with 56 million lines. But it may just squeak by with 54 million. Sorry, couldn't resist.	[reply]