If you can't limit yourself to 16 characters, a hash can be used.
my $lookup = '';
while( <SMALLFILE> ) {
chomp;
$lookup{$_} = 1;
}
while( <BIGFILE> ) {
chomp;
print if $lookup{$_};
}
You could pack them smaller if memory is a concern
my %acgt = (
A => '00',
C => '01',
G => '10',
T => '11',
);
sub encode {
my $string = uc shift;
(my $bin = $string) =~ s/(.)/$acgt{$1}/g
return pack 'B*', $bin;
}
my $lookup = '';
while( <SMALLFILE> ) {
chomp;
$lookup{encode($_)} = 1;
}
while( <BIGFILE> ) {
chomp;
print if $lookup{encode($_)};
}
As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".
| [reply] [d/l] [select] |
In reverse order:
As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".
From the OP: (all same length, pretty short) ... (again, all same length as before, ...)
If this were not the case, my bit-string encoding wouldn't work as inputs cc and cca and ccaaa ect. would all be encoded the same as ccaaaaaaaaaaaaaa.
You could pack them smaller if memory is a concern
Using your compacted hash keys, you save almost exactly 20% of memory relative to a normal hash:
c:\test>hashsize -N=1e3
normalHash: 58156
compactedHash: 46156
c:\test>hashsize -N=1e4
normalHash: 605596
compactedHash: 485596
c:\test>hashsize -N=1e5
normalHash: 5924348
compactedHash: 4724348
Which projecting out to the OPs (tens of millions) of strings in the smaller file, means your compacted hash will require 480MB for 10million which is about the same size as my bitvector.
However, if the number of keys grows to 20e6, you need 960MB and for 30e6 1440MB and so on, which mean that on your average 32-bit machine, you're going to be limited to low 10s of millions, whereas my bitvector will never require more than 512MB.
- Finally, compacting your keys means that lookups in your hash are going to be slower than a ordinary hash (by almost an order of magnitude in a quick test).
By contrast, as I showed here using vec as a lookup mechanism is over an order of magnitude faster than a hash lookup.
So, a bounded and reasonable memory requirement and performance faster than a native hash. Got to be worth a mention don't you think?
That said, there are a couple of problems with the untested idea. The first is trivial to fix, I made a thinko in my encode() sub:
vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 }
for 0 .. 15;
should have been:
## vvvvv
vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 }
for 0 .. 15;
The second is rather less tractable as it constitutes a bug in Perl, specifically a bug in vec. Whilst the bitvector is only 512MB, the offsets can take the full unsigned integer range 0 .. 2**32.
However, vec treats the offset values passed as IVs rather than UVs and so any offset >2**31 results in: Negative offset to vec in lvalue context ....
This is fairly easy to work around--break the bitvector into two chunks to keep the offsets < 2**31--but the conditionals and math slows thing down somewhat. It is still faster than a normal hash, but none the less annoying as it shouldn't be necessary. I think the following changes to vec would fix the problem:
## sv.c
PP(pp_vec)
{
dVAR; dSP; dTARGET;
register const IV size = POPi;
// register const IV offset = POPi;
register const UV offset = POPi;
register SV * const src = POPs;
## doop.c
UV
//Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size)
Perl_do_vecget(pTHX_ SV *sv, U32 offset, I32 size)
...
void
Perl_do_vecset(pTHX_ SV *sv)
{
dVAR;
// register I32 offset, bitoffs = 0;
register U32 offset
register I32 bitoffs = 0;
register I32 size;
register unsigned char *s;
register UV lval;
I32 mask;
STRLEN targlen;
STRLEN len;
SV * const targ = LvTARG(sv);
if (!targ)
return;
s = (unsigned char*)SvPV_force(targ, targlen);
if (SvUTF8(targ)) {
/* This is handled by the SvPOK_only below...
if (!Perl_sv_utf8_downgrade(aTHX_ targ, TRUE))
SvUTF8_off(targ);
*/
(void) Perl_sv_utf8_downgrade(aTHX_ targ, TRUE);
}
(void)SvPOK_only(targ);
lval = SvUV(sv);
offset = LvTARGOFF(sv);
// if (offset < 0)
// Perl_croak(aTHX_ "Negative offset to vec in lvalue context");
I'd raise a patch but as it takes me about 3 days to download bleedperl, by the time I have it its already potentially out-of-date, so any patch might need to be manually applied anyway. If you agree this is a bug and are set up to produce patches against bleed, you might consider doing it?
No doubt I will be condemned for my emphasis on performance, but that was the OPs primary concern: What approach would be really super-fast?.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Just simple code and very fast lookup.
For a mere mortal like myself, the code isn't that simple :-)
Sure, the lookup is very fast, but how about the indexer function encode()?
vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 }
for 0 .. 15;
Is it worth the effort looping concatenations through substr()'s and performing hash lookup before vec()torizing into $bits, followed by unpack()ing it? This overhead is less noticeable, right? Just curious.
Note: I made the correction for $bits. | [reply] [d/l] [select] |
or a mere mortal like myself, the code isn't that simple :-)
... looping concatenations through substr()'s and performing hash lookup before vec()torizing into $bits, followed by unpack()ing it?
Your description seems spot on, so it's not that complicated ;)
This overhead is less noticeable, right?
Less noticable than what? See Re^3: Comparing strings (exact matches) in LARGE numbers FAST for the full skinny, but in summary:
The memory requirement is fixed (512MB) for any number of keys, and the lookup is at least as fast as a native hash which would break the memory of a 32-bit machine with around 50e6 keys.
If the bug in vec is fixed, it will be significantly faster than a native hash.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
I missed the bit that said that all strings are the same length... so I started worrying that this wouldn't work with variable length strings, because a 2 bit code treats 'CG' as 'CGAAAAAA..A' or 'A..AAACG' (assuming 'A' is encoded as 0).
FWIW: if the requirements were to change:
My first thought was to encode 4 bits of length, followed by 2 bit symbols. In 32 bits you can then encode 14 character symbols, or in 36 bits, 16. [I rejected adding a fifth symbol (end). Encoding at 3 bits per symbol would be longer for strings > 4 symbols. And I couldn't face base 5, which in any case would be longer for strings > 12 symbols.]
The best approach is to use a separate bit vector for each string length -- making sure that the encoding is left-justified -- so, for example, 8 character strings are encoded as 0..65535. All possible strings up to 16 symbols now require 1G bytes in 16 bit vectors.
As you say, this is more or less bearable, but each extra symbol multiplies the memory requirement by 4 :-(.
| [reply] |