in reply to Comparing strings (exact matches) in LARGE numbers FAST

Somebody mentioned encoding the strings of characters into strings of bits, but I am not sure how to do that and how fast it would be?

If your strings are upto 16 characters in length, then bit-encoding the characters will result in a 32-bit unsigned integer:

my %acgt; @acgt{ qw[ a c g t ] } = 0 .. 3; sub encode { my $string = lc shift; die "String '$string' >16 chars" if length $string > 16; my $bits = ''; vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15; return unpack 'N', $bits; }

You can then build a single bit-vector to represent the entire search space using 1-bit per possible string in ram (512MB). Set the bits corresponding to the contents of your smaller file:

my $lookup = ''; while( <SMALLFILE> ) { chomp; vec( $lookup, encode( $_), 1 ) = 1; }

and then processing the larger file becomes an O(1) lookup:

while( <BIGFILE> ) { chomp; print if vec( $lookup, encode( $_ ), 1 ); }

No sorting, searching, hashing or DBs. Just simple code and very fast lookup. If your strings are 16 characters or less.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST
by ikegami (Patriarch) on Aug 29, 2008 at 05:52 UTC
    If you can't limit yourself to 16 characters, a hash can be used.
    my $lookup = ''; while( <SMALLFILE> ) { chomp; $lookup{$_} = 1; } while( <BIGFILE> ) { chomp; print if $lookup{$_}; }

    You could pack them smaller if memory is a concern

    my %acgt = ( A => '00', C => '01', G => '10', T => '11', ); sub encode { my $string = uc shift; (my $bin = $string) =~ s/(.)/$acgt{$1}/g return pack 'B*', $bin; } my $lookup = ''; while( <SMALLFILE> ) { chomp; $lookup{encode($_)} = 1; } while( <BIGFILE> ) { chomp; print if $lookup{encode($_)}; }

    As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".

      In reverse order:

      1. As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".

        From the OP:

        (all same length, pretty short) ... (again, all same length as before, ...)

        If this were not the case, my bit-string encoding wouldn't work as inputs cc and cca and ccaaa ect. would all be encoded the same as ccaaaaaaaaaaaaaa.

      2. You could pack them smaller if memory is a concern

        Using your compacted hash keys, you save almost exactly 20% of memory relative to a normal hash:

        c:\test>hashsize -N=1e3 normalHash: 58156 compactedHash: 46156 c:\test>hashsize -N=1e4 normalHash: 605596 compactedHash: 485596 c:\test>hashsize -N=1e5 normalHash: 5924348 compactedHash: 4724348

        Which projecting out to the OPs (tens of millions) of strings in the smaller file, means your compacted hash will require 480MB for 10million which is about the same size as my bitvector.

        However, if the number of keys grows to 20e6, you need 960MB and for 30e6 1440MB and so on, which mean that on your average 32-bit machine, you're going to be limited to low 10s of millions, whereas my bitvector will never require more than 512MB.

      3. Finally, compacting your keys means that lookups in your hash are going to be slower than a ordinary hash (by almost an order of magnitude in a quick test).

        By contrast, as I showed here using vec as a lookup mechanism is over an order of magnitude faster than a hash lookup.

      So, a bounded and reasonable memory requirement and performance faster than a native hash. Got to be worth a mention don't you think?


      That said, there are a couple of problems with the untested idea. The first is trivial to fix, I made a thinko in my encode() sub:

      vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

      should have been:

      ## vvvvv vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

      The second is rather less tractable as it constitutes a bug in Perl, specifically a bug in vec. Whilst the bitvector is only 512MB, the offsets can take the full unsigned integer range 0 .. 2**32.

      However, vec treats the offset values passed as IVs rather than UVs and so any offset >2**31 results in: Negative offset to vec in lvalue context ....

      This is fairly easy to work around--break the bitvector into two chunks to keep the offsets < 2**31--but the conditionals and math slows thing down somewhat. It is still faster than a normal hash, but none the less annoying as it shouldn't be necessary. I think the following changes to vec would fix the problem:

      ## sv.c PP(pp_vec) { dVAR; dSP; dTARGET; register const IV size = POPi; // register const IV offset = POPi; register const UV offset = POPi; register SV * const src = POPs; ## doop.c UV //Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size) Perl_do_vecget(pTHX_ SV *sv, U32 offset, I32 size) ... void Perl_do_vecset(pTHX_ SV *sv) { dVAR; // register I32 offset, bitoffs = 0; register U32 offset register I32 bitoffs = 0; register I32 size; register unsigned char *s; register UV lval; I32 mask; STRLEN targlen; STRLEN len; SV * const targ = LvTARG(sv); if (!targ) return; s = (unsigned char*)SvPV_force(targ, targlen); if (SvUTF8(targ)) { /* This is handled by the SvPOK_only below... if (!Perl_sv_utf8_downgrade(aTHX_ targ, TRUE)) SvUTF8_off(targ); */ (void) Perl_sv_utf8_downgrade(aTHX_ targ, TRUE); } (void)SvPOK_only(targ); lval = SvUV(sv); offset = LvTARGOFF(sv); // if (offset < 0) // Perl_croak(aTHX_ "Negative offset to vec in lvalue context");

      I'd raise a patch but as it takes me about 3 days to download bleedperl, by the time I have it its already potentially out-of-date, so any patch might need to be manually applied anyway. If you agree this is a bug and are set up to produce patches against bleed, you might consider doing it?

      No doubt I will be condemned for my emphasis on performance, but that was the OPs primary concern: What approach would be really super-fast?.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST
by repellent (Priest) on Aug 29, 2008 at 06:17 UTC
      Just simple code and very fast lookup.

    For a mere mortal like myself, the code isn't that simple :-)

    Sure, the lookup is very fast, but how about the indexer function encode()?
    vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

    Is it worth the effort looping concatenations through substr()'s and performing hash lookup before vec()torizing into $bits, followed by unpack()ing it? This overhead is less noticeable, right? Just curious.

    Note: I made the correction for $bits.
      or a mere mortal like myself, the code isn't that simple :-)

      ... looping concatenations through substr()'s and performing hash lookup before vec()torizing into $bits, followed by unpack()ing it?

      Your description seems spot on, so it's not that complicated ;)

      This overhead is less noticeable, right?

      Less noticable than what? See Re^3: Comparing strings (exact matches) in LARGE numbers FAST for the full skinny, but in summary:

      The memory requirement is fixed (512MB) for any number of keys, and the lookup is at least as fast as a native hash which would break the memory of a 32-bit machine with around 50e6 keys.

      If the bug in vec is fixed, it will be significantly faster than a native hash.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Comparing strings (exact matches) in LARGE numbers FAST
by gone2015 (Deacon) on Aug 29, 2008 at 18:32 UTC

    I missed the bit that said that all strings are the same length... so I started worrying that this wouldn't work with variable length strings, because a 2 bit code treats 'CG' as 'CGAAAAAA..A' or 'A..AAACG' (assuming 'A' is encoded as 0).

    FWIW: if the requirements were to change:

    My first thought was to encode 4 bits of length, followed by 2 bit symbols. In 32 bits you can then encode 14 character symbols, or in 36 bits, 16. [I rejected adding a fifth symbol (end). Encoding at 3 bits per symbol would be longer for strings > 4 symbols. And I couldn't face base 5, which in any case would be longer for strings > 12 symbols.]

    The best approach is to use a separate bit vector for each string length -- making sure that the encoding is left-justified -- so, for example, 8 character strings are encoded as 0..65535. All possible strings up to 16 symbols now require 1G bytes in 16 bit vectors.

    As you say, this is more or less bearable, but each extra symbol multiplies the memory requirement by 4 :-(.