So you have 20M records. For each of those records, you have generated a SHA-1 signature. First, that is "not ok" as an ID from the "get-go" because a SHA-1 signature is not guaranteed to be unique. The idea of compressing a non-unique set of bits into a smaller set of bits that is unique just doesn't make sense!

You haven't explained how big this DB is? I guess that it is possible although VERY unlikely that this DB is small enough to be memory resident.

If we just think about storing just the 20M SHA-1 signatures, each is 20 bytes. For the hardware, powers of 2 are magic and it goes: 2,4,8,16,32. In a practical sense, each signature will take 32 bytes: 8 32 bit(4 byte) words or 16 16 bit(2 byte) words. That is a fair amount of memory for 20M records (like 640MB) and these "keys", (they are SHA-1 signatures) aren't even unique! I don't know what your plan is to deal with that. Oh, of course besides the memory to store the SHA-1 signatures, there has to be some data that points to something (on disk or wherever). That will take some bytes too!

You need a Database. Perl DBI in its many flavors can easily handle 20M records. Forget SHA-1 or SHA-2 that makes no sense. Let the DB use its hash algorithm.


In reply to Re: Question: methods to transfer a long hexadicimal into shorter string by Marshall
in thread Question: methods to transfer a long hexadicimal into shorter string by lihao

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.