in reply to Re^4: storing hash in temporary files to save memory usage
in thread storing hash in temporary files to save memory usage
You didn't so much "say something wrong" as display a pattern of responses that when allied with your anonymonk status lead me to believe that any time expended attempting to help you would probably be wasted.
And no description of the application.
But these turn out to be completely unrepresentative of the actual data you are working with.
Is that 10GB per file, or across the multiple files? If the latter, how many files is "many"?
Your "sample data" showed 9 character keys drawn from a 5 character alphabet. Each of those 5 characters can be represented by a 3-bit pattern and 9x3=27 bits, thus the keys can be perfectly hashed -- uniquely identified -- by a single 32-bit number. With 5**9=1953125 unique keys, you could collate your entire datasets, regardless of the number and size of files, using a single, in-memory array of just under 2 million counts. A simple, very fast, single pass, and the task is complete in an hour or two.
But, you eventually identify that your keys are actually 150 characters: that would make for 450-bit or 57-byte perfect hash. Ie. 7.0064923216240853546186479164496e+75 possible keys. Let me try and make sense of that number for you: 700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015625
If there are the estimated 1 billion trillion stars in the universe, and each of them had 10 earth-like planets around them, and each of those had the 7.5 quintillion grains of sand that it is estimated there are on earth, and they were each converted into a memory chip of 1GB, there would still be only 1 / 934198976216544713949153055 the amount of memory required to apply the solution I was offering -- based on the information provided to that point -- to your actual dataset.
So you see, there is little point in pursuing the solution I was offering, given the nature of your actual data.
If this was a one-off process, you could try the methods identified by others in this thread.
Of course, if each of your "multiple files" is 10GB, and you need to sort them all ....
And if, as implied, you will need to be doing this process over and over again on many equally huge datasets, that sorting will likely become a bottleneck in your research.
But my experiences of using that -- remarkable, on small datasets -- tool on similar datasets of just 1GB have proved to my satisfaction that it isn't really up to the task.
And remember, it doesn't play well with concurrent access; so it takes the simple expedient of using multi-tasking -- via threads or processes -- to quarter or eighth or 16th your runtimes, right off the table.
But still the fundamental underlying problem of providing efficient access to such a potentially vast range of keys remains. Indexing does nothing if the index field is 150 characters. Postgres (and other serious RDBMSs) have some specialist proprietary methods for dealing with such data, but you will have to research which ones to use and how to make best use of them.
And in the end, the process of loading up and indexing/hashing/whatever is required, large datasets, in order to perform one-off collating tasks like you've described, often takes longer on its own, than performing that one-off task itself using non-RDBMS methods.
Even the process of describing the -- essentially simple counting task -- in SQL terms of joins across multiple tables; and then tweaking and re-tweaking the schema and query to encourage SQL to perform that process efficiently, becomes quite complicated and requires intimate knowledge of the particular RDBMS and its proprietary extensions; which for one-off process becomes prohibitive.
And repeating it for different datasets -- with different key sizes and alphabets, different numbers of tables, perhaps variations in original data formats -- only to discard everything once the task is done ....
Finally, on the basis of the information you have provided -- so far, assuming there are no further twists and turns -- the method I would probably use for your task is an extension of the one I was offering above.
You do one pass over your files, generate perfect hash number from the first 10 characters of your dna, using the 3-bits per char method (10x3=30, still fits into 32-bit integer), and against these perfect hash values you store the fileID and byte offset of the record. This will effectively insertion sort your dataset into 5**10=9,765,625 subsets that share the same first 10 chars.
Assuming 10GB represents the combined size of your files, and guestimating that your individual multi-line records are 300 chars, that gives an estimate of 36 million total records. Split that into just under 10 million subsets gives an average of just 4 records per subset.
Having performed one fast pass over the dataset to find the subsets, a second pass over the subsets, fetching -- by direct random access -- each of the keys within a subset for full comparison will again be a relatively fast process as it requires only reads.
The whole process should take less than a couple of hours which is probably less time than it would take to load the dataset into a DB; and certainly less time than sorting it completely using your systems sort utility.
|
|---|