Re^4: read and sort multiple files

Replies are listed 'Best First'.
Re^5: read and sort multiple files by ikegami (Patriarch) on Dec 02, 2008 at 15:02 UTC
Sorry, I don't understand the above comment at all. Would you mind explaining that? Let's say you have 100 files to process: 2,250,000 lines 500,000 lines And the rest are all 10,000 lines long Let's say you have enough memory to handle 1,000,000 lines. Let's ignore byte size for simplicity, but you'll I placed a limit on that as well in the `split` example. You have two problems: You have many files to merge. One of your files is too big. The first problem is easy to fix: Just concatenate all the files. Then you're left with one file that's too big. 3,730,000 lines The second problem is easy to fix as well: Just split the file into smaller pieces. Just keep those pieces as big as possible. You end up with: 1,000,000 lines 1,000,000 lines 1,000,000 lines 730,000 lines If all you had done was split the large file, you would have 101 files to sort and merge, resulting in 100 merges. By concatenating first, all you had to do is 4 (long) merges. It cuts down on overhead a little.	[reply] [d/l]
Re^5: read and sort multiple files by ikegami (Patriarch) on Dec 02, 2008 at 15:03 UTC
This cannot be guaranteed, what if the file is too big Can't be. It sorts the split chunks, not the original files.	[reply]
Re^6: read and sort multiple files by matrixmadhan (Beadle) on Dec 05, 2008 at 18:47 UTC
Yes, that's right. Sorting on the split chunks isn't free as well as there is additional 'n-1' open and close involved with the extra 'n-1' files when working with split chunks of a file	[reply]
Re^7: read and sort multiple files by ikegami (Patriarch) on Dec 06, 2008 at 08:04 UTC
What extra open and closes? Grouping files greatly cuts down on the number of opens and closes. In the case of a few small files, say 10 files, 100 of which fit in memory at a time. Without grouping: To sort: 210 opens To merge: 3(10-1) opens Total: 47 With grouping: To sort: 10 + ceil(10/100) opens To merge: 3(ceil(10/100)-1) opens Total: 11 In the case of lots of small files, say 100 files, 10 of which fit in memory at a time. Without grouping: To sort: 2100 opens To merge: 3(100-1) opens Total: 497 With grouping: To sort: 100 + ceil(100/10) opens To merge: 3(ceil(100/10)-1) opens Total: 137 In the case of big files, it simply won't work without splitting. You could cut down on the number of opens and closes by merging more than two files at a time, but it's not relevant here because it would apply to both grouping and non-grouping algorithms.	[reply] [d/l]
Re^8: read and sort multiple files by matrixmadhan (Beadle) on Dec 06, 2008 at 14:52 UTC
Re^9: read and sort multiple files by ikegami (Patriarch) on Dec 06, 2008 at 15:19 UTC