in reply to Re^2: Comparing strings (exact matches) in LARGE numbers FAST
in thread Comparing strings (exact matches) in LARGE numbers FAST

In reverse order:

  1. As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".

    From the OP:

    (all same length, pretty short) ... (again, all same length as before, ...)

    If this were not the case, my bit-string encoding wouldn't work as inputs cc and cca and ccaaa ect. would all be encoded the same as ccaaaaaaaaaaaaaa.

  2. You could pack them smaller if memory is a concern

    Using your compacted hash keys, you save almost exactly 20% of memory relative to a normal hash:

    c:\test>hashsize -N=1e3 normalHash: 58156 compactedHash: 46156 c:\test>hashsize -N=1e4 normalHash: 605596 compactedHash: 485596 c:\test>hashsize -N=1e5 normalHash: 5924348 compactedHash: 4724348

    Which projecting out to the OPs (tens of millions) of strings in the smaller file, means your compacted hash will require 480MB for 10million which is about the same size as my bitvector.

    However, if the number of keys grows to 20e6, you need 960MB and for 30e6 1440MB and so on, which mean that on your average 32-bit machine, you're going to be limited to low 10s of millions, whereas my bitvector will never require more than 512MB.

  3. Finally, compacting your keys means that lookups in your hash are going to be slower than a ordinary hash (by almost an order of magnitude in a quick test).

    By contrast, as I showed here using vec as a lookup mechanism is over an order of magnitude faster than a hash lookup.

So, a bounded and reasonable memory requirement and performance faster than a native hash. Got to be worth a mention don't you think?


That said, there are a couple of problems with the untested idea. The first is trivial to fix, I made a thinko in my encode() sub:

vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

should have been:

## vvvvv vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

The second is rather less tractable as it constitutes a bug in Perl, specifically a bug in vec. Whilst the bitvector is only 512MB, the offsets can take the full unsigned integer range 0 .. 2**32.

However, vec treats the offset values passed as IVs rather than UVs and so any offset >2**31 results in: Negative offset to vec in lvalue context ....

This is fairly easy to work around--break the bitvector into two chunks to keep the offsets < 2**31--but the conditionals and math slows thing down somewhat. It is still faster than a normal hash, but none the less annoying as it shouldn't be necessary. I think the following changes to vec would fix the problem:

## sv.c PP(pp_vec) { dVAR; dSP; dTARGET; register const IV size = POPi; // register const IV offset = POPi; register const UV offset = POPi; register SV * const src = POPs; ## doop.c UV //Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size) Perl_do_vecget(pTHX_ SV *sv, U32 offset, I32 size) ... void Perl_do_vecset(pTHX_ SV *sv) { dVAR; // register I32 offset, bitoffs = 0; register U32 offset register I32 bitoffs = 0; register I32 size; register unsigned char *s; register UV lval; I32 mask; STRLEN targlen; STRLEN len; SV * const targ = LvTARG(sv); if (!targ) return; s = (unsigned char*)SvPV_force(targ, targlen); if (SvUTF8(targ)) { /* This is handled by the SvPOK_only below... if (!Perl_sv_utf8_downgrade(aTHX_ targ, TRUE)) SvUTF8_off(targ); */ (void) Perl_sv_utf8_downgrade(aTHX_ targ, TRUE); } (void)SvPOK_only(targ); lval = SvUV(sv); offset = LvTARGOFF(sv); // if (offset < 0) // Perl_croak(aTHX_ "Negative offset to vec in lvalue context");

I'd raise a patch but as it takes me about 3 days to download bleedperl, by the time I have it its already potentially out-of-date, so any patch might need to be manually applied anyway. If you agree this is a bug and are set up to produce patches against bleed, you might consider doing it?

No doubt I will be condemned for my emphasis on performance, but that was the OPs primary concern: What approach would be really super-fast?.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."