You didn't so much "say something wrong" as display a pattern of responses that when allied with your anonymonk status lead me to believe that any time expended attempting to help you would probably be wasted.

  1. First you ask a very generic question that can be answered by typing simple queries into cpan: tie+hash, DB hash, file hash.

    And no description of the application.

  2. When prompted you provide some details.

    But these turn out to be completely unrepresentative of the actual data you are working with.

  3. Even the details you provided are reluctant & sketchy."compare the sequence in multiple large files.... execute for any number of files however large ... more than 10GB data ...".

    Is that 10GB per file, or across the multiple files? If the latter, how many files is "many"?

  4. And finally, you seem to have unrealistic expectations and be more concerned with being provided with a totally generic solution rather than looking to solve a specific problem.

Your "sample data" showed 9 character keys drawn from a 5 character alphabet. Each of those 5 characters can be represented by a 3-bit pattern and 9x3=27 bits, thus the keys can be perfectly hashed -- uniquely identified -- by a single 32-bit number. With 5**9=1953125 unique keys, you could collate your entire datasets, regardless of the number and size of files, using a single, in-memory array of just under 2 million counts. A simple, very fast, single pass, and the task is complete in an hour or two.

But, you eventually identify that your keys are actually 150 characters: that would make for 450-bit or 57-byte perfect hash. Ie. 7.0064923216240853546186479164496e+75 possible keys. Let me try and make sense of that number for you: 700649232162408535461864791644958065640130970938257885878534141944895541342930300743319094181060791015625

If there are the estimated 1 billion trillion stars in the universe, and each of them had 10 earth-like planets around them, and each of those had the 7.5 quintillion grains of sand that it is estimated there are on earth, and they were each converted into a memory chip of 1GB, there would still be only 1 / 934198976216544713949153055 the amount of memory required to apply the solution I was offering -- based on the information provided to that point -- to your actual dataset.

So you see, there is little point in pursuing the solution I was offering, given the nature of your actual data.

What else can you do?

If this was a one-off process, you could try the methods identified by others in this thread.

How would I attempt the task as now described?

Finally, on the basis of the information you have provided -- so far, assuming there are no further twists and turns -- the method I would probably use for your task is an extension of the one I was offering above.

You do one pass over your files, generate perfect hash number from the first 10 characters of your dna, using the 3-bits per char method (10x3=30, still fits into 32-bit integer), and against these perfect hash values you store the fileID and byte offset of the record. This will effectively insertion sort your dataset into 5**10=9,765,625 subsets that share the same first 10 chars.

Assuming 10GB represents the combined size of your files, and guestimating that your individual multi-line records are 300 chars, that gives an estimate of 36 million total records. Split that into just under 10 million subsets gives an average of just 4 records per subset.

Having performed one fast pass over the dataset to find the subsets, a second pass over the subsets, fetching -- by direct random access -- each of the keys within a subset for full comparison will again be a relatively fast process as it requires only reads.

The whole process should take less than a couple of hours which is probably less time than it would take to load the dataset into a DB; and certainly less time than sorting it completely using your systems sort utility.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^5: storing hash in temporary files to save memory usage by BrowserUk
in thread storing hash in temporary files to save memory usage by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.