in reply to Re: Comparing strings (exact matches) in LARGE numbers FAST
in thread Comparing strings (exact matches) in LARGE numbers FAST

I missed the bit that said that all strings are the same length... so I started worrying that this wouldn't work with variable length strings, because a 2 bit code treats 'CG' as 'CGAAAAAA..A' or 'A..AAACG' (assuming 'A' is encoded as 0).

FWIW: if the requirements were to change:

My first thought was to encode 4 bits of length, followed by 2 bit symbols. In 32 bits you can then encode 14 character symbols, or in 36 bits, 16. [I rejected adding a fifth symbol (end). Encoding at 3 bits per symbol would be longer for strings > 4 symbols. And I couldn't face base 5, which in any case would be longer for strings > 12 symbols.]

The best approach is to use a separate bit vector for each string length -- making sure that the encoding is left-justified -- so, for example, 8 character strings are encoded as 0..65535. All possible strings up to 16 symbols now require 1G bytes in 16 bit vectors.

As you say, this is more or less bearable, but each extra symbol multiplies the memory requirement by 4 :-(.

  • Comment on Re^2: Comparing strings (exact matches) in LARGE numbers FAST