in reply to Re^3: storing hash in temporary files to save memory usage
in thread storing hash in temporary files to save memory usage
I am really very sorry if I said something wrong. I just mentioned that the length should not be fixed because you asked me about the length of dna sequences. My problem is to find a way which can reduce memory consumption so that my code works with large multiple files and it does not get killed. Any help will be appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: storing hash in temporary files to save memory usage
by BrowserUk (Patriarch) on Sep 22, 2016 at 08:53 UTC | |
You didn't so much "say something wrong" as display a pattern of responses that when allied with your anonymonk status lead me to believe that any time expended attempting to help you would probably be wasted. Your "sample data" showed 9 character keys drawn from a 5 character alphabet. Each of those 5 characters can be represented by a 3-bit pattern and 9x3=27 bits, thus the keys can be perfectly hashed -- uniquely identified -- by a single 32-bit number. With 5**9=1953125 unique keys, you could collate your entire datasets, regardless of the number and size of files, using a single, in-memory array of just under 2 million counts. A simple, very fast, single pass, and the task is complete in an hour or two. But, you eventually identify that your keys are actually 150 characters: that would make for 450-bit or 57-byte perfect hash. Ie. 7.0064923216240853546186479164496e+75 possible keys. Let me try and make sense of that number for you: 700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015625 If there are the estimated 1 billion trillion stars in the universe, and each of them had 10 earth-like planets around them, and each of those had the 7.5 quintillion grains of sand that it is estimated there are on earth, and they were each converted into a memory chip of 1GB, there would still be only 1 / 934198976216544713949153055 the amount of memory required to apply the solution I was offering -- based on the information provided to that point -- to your actual dataset. So you see, there is little point in pursuing the solution I was offering, given the nature of your actual data. What else can you do?If this was a one-off process, you could try the methods identified by others in this thread. How would I attempt the task as now described?Finally, on the basis of the information you have provided -- so far, assuming there are no further twists and turns -- the method I would probably use for your task is an extension of the one I was offering above. You do one pass over your files, generate perfect hash number from the first 10 characters of your dna, using the 3-bits per char method (10x3=30, still fits into 32-bit integer), and against these perfect hash values you store the fileID and byte offset of the record. This will effectively insertion sort your dataset into 5**10=9,765,625 subsets that share the same first 10 chars. Assuming 10GB represents the combined size of your files, and guestimating that your individual multi-line records are 300 chars, that gives an estimate of 36 million total records. Split that into just under 10 million subsets gives an average of just 4 records per subset. Having performed one fast pass over the dataset to find the subsets, a second pass over the subsets, fetching -- by direct random access -- each of the keys within a subset for full comparison will again be a relatively fast process as it requires only reads. The whole process should take less than a couple of hours which is probably less time than it would take to load the dataset into a DB; and certainly less time than sorting it completely using your systems sort utility. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re^5: storing hash in temporary files to save memory usage
by BrowserUk (Patriarch) on Sep 24, 2016 at 16:09 UTC | |
For my own satisfaction, I generated just over 10GB of data spread across 4 files:
That vaguely approximated your described data:
I implemented the code I described under How would I attempt the task as now described? in 55 lines, and ran it:
So 1 hour 20 minutes. My "couple of hours" gestimate wasn't too far wrong. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |