I missed the bit that said that all strings are the same length... so I started worrying that this wouldn't work with variable length strings, because a 2 bit code treats 'CG' as 'CGAAAAAA..A' or 'A..AAACG' (assuming 'A' is encoded as 0).

FWIW: if the requirements were to change:

My first thought was to encode 4 bits of length, followed by 2 bit symbols. In 32 bits you can then encode 14 character symbols, or in 36 bits, 16. [I rejected adding a fifth symbol (end). Encoding at 3 bits per symbol would be longer for strings > 4 symbols. And I couldn't face base 5, which in any case would be longer for strings > 12 symbols.]

The best approach is to use a separate bit vector for each string length -- making sure that the encoding is left-justified -- so, for example, 8 character strings are encoded as 0..65535. All possible strings up to 16 symbols now require 1G bytes in 16 bit vectors.

As you say, this is more or less bearable, but each extra symbol multiplies the memory requirement by 4 :-(.


In reply to Re^2: Comparing strings (exact matches) in LARGE numbers FAST by gone2015
in thread Comparing strings (exact matches) in LARGE numbers FAST by perlSD

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.