Somebody mentioned encoding the strings of characters into strings of bits, but I am not sure how to do that and how fast it would be?
If your strings are upto 16 characters in length, then bit-encoding the characters will result in a 32-bit unsigned integer:
my %acgt; @acgt{ qw[ a c g t ] } = 0 .. 3; sub encode { my $string = lc shift; die "String '$string' >16 chars" if length $string > 16; my $bits = ''; vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15; return unpack 'N', $bits; }
You can then build a single bit-vector to represent the entire search space using 1-bit per possible string in ram (512MB). Set the bits corresponding to the contents of your smaller file:
my $lookup = ''; while( <SMALLFILE> ) { chomp; vec( $lookup, encode( $_), 1 ) = 1; }
and then processing the larger file becomes an O(1) lookup:
while( <BIGFILE> ) { chomp; print if vec( $lookup, encode( $_ ), 1 ); }
No sorting, searching, hashing or DBs. Just simple code and very fast lookup. If your strings are 16 characters or less.
In reply to Re: Comparing strings (exact matches) in LARGE numbers FAST
by BrowserUk
in thread Comparing strings (exact matches) in LARGE numbers FAST
by perlSD
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |