Well I need some advice and pointers. Let me first explain my situation. I have this huge (ca. 20GB) text file that looks like this:
what i need to do is to sort my text file first according to my keys in the first column such that all records having key1 come first followed by key2 up to key43. Next each record inside each bucked needs to be sorted again according to the second key column. (There are only two columns, that is two keys). Now the fastest way I imagine is to create 43 bucket files and then just iterate through main file and print records accordingly. Once done, repeat the process in each bucket. Afterworlds join files and delete unnecessary buckets(files).key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... i believe the structure is clear: - there are two keys - keys can be repeated
The downside is if a sorting is interrupted then my temp.bucket files remain on disc and have to be removed by hand. Alternatively i could intercept the sigint and delete buckets before program terminates.
What I came here to ask is, does someone have a better solution(faster, does not consume a lot of memory (100MB top)) and does not create this file mess on my disc.
any comment is more then welcomed
Thank you
baxy
UPDATE:
to keet the "file explostion" under controle, would it be possible to create "virtual file" - one file but divided into sections and then print into different sections (something like using fseek in c)- would that be advisable ???
In reply to sorting type question- space problems by baxy77bax
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |