perlSD has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have to compare a large number (tens of millions) of strings (all same length, pretty short) in a file (one string per line), to an even larger number (hundreds of millions) of strings (again, all same length as before, one string per line).
In each file, each string is unique, i.e. encountered only once in that file.
Note: the strings are made of only letters. More exactly, 4 letters - nucleotide bases (A,C,G,T).

I am looking ONLY for EXACT matches.
What approach would be really super-fast?

Many thanks in advance!!!

Wow! So many interesting and elegant solutions! I will try a few!
For those who were considering the length of the strings: it is currently 32, but the code really has to work with any length between 25-100.

  • Comment on Comparing strings (exact matches) in LARGE numbers FAST

Replies are listed 'Best First'.
Re: Comparing strings (exact matches) in LARGE numbers FAST
by tilly (Archbishop) on Aug 28, 2008 at 23:35 UTC
    Frankly I'd tackle this by sorting each file using the Unix sort utility and then scanning in parallel merge to filter them.

    The scanning logic would look like this (untested):

    #! /usr/bin/perl -w use strict; unless (2 == @ARGV) { die "Usage: $0 file1 file2"; } open(FILE1, "<", $ARGV[0]) or die "Can't read '$ARGV[0]': $!"; open(FILE2, "<", $ARGV[1]) or die "Can't read '$ARGV[1]': $!"; my $string1 = <FILE1>; my $string2 = <FILE2>; while (not $is_end) { if ($string1 lt $string2) { $string1 = <FILE1> or last; } elsif ($string1 gt $string2) { $string2 = <FILE2> or last; } else { print $string1; $string1 = <FILE1> or last; $string2 = <FILE2> or last; } }
      After sorting each file, you can just use the unix "join" command.
        Ah. Good point. However usually when I use this technique, I'm doing a little more logic so I can't use that. (And I'm usually dealing with a dataset that didn't fit in the database I was using, so I can't use that approach either.)
      Thanks! That's pretty neat! I'll try that. Although the sorting may take a few minutes, at least on my Windows machine (I have Unix-like tools I can use from the command line)....
Re: Comparing strings (exact matches) in LARGE numbers FAST
by BrowserUk (Patriarch) on Aug 29, 2008 at 04:09 UTC
    Somebody mentioned encoding the strings of characters into strings of bits, but I am not sure how to do that and how fast it would be?

    If your strings are upto 16 characters in length, then bit-encoding the characters will result in a 32-bit unsigned integer:

    my %acgt; @acgt{ qw[ a c g t ] } = 0 .. 3; sub encode { my $string = lc shift; die "String '$string' >16 chars" if length $string > 16; my $bits = ''; vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15; return unpack 'N', $bits; }

    You can then build a single bit-vector to represent the entire search space using 1-bit per possible string in ram (512MB). Set the bits corresponding to the contents of your smaller file:

    my $lookup = ''; while( <SMALLFILE> ) { chomp; vec( $lookup, encode( $_), 1 ) = 1; }

    and then processing the larger file becomes an O(1) lookup:

    while( <BIGFILE> ) { chomp; print if vec( $lookup, encode( $_ ), 1 ); }

    No sorting, searching, hashing or DBs. Just simple code and very fast lookup. If your strings are 16 characters or less.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If you can't limit yourself to 16 characters, a hash can be used.
      my $lookup = ''; while( <SMALLFILE> ) { chomp; $lookup{$_} = 1; } while( <BIGFILE> ) { chomp; print if $lookup{$_}; }

      You could pack them smaller if memory is a concern

      my %acgt = ( A => '00', C => '01', G => '10', T => '11', ); sub encode { my $string = uc shift; (my $bin = $string) =~ s/(.)/$acgt{$1}/g return pack 'B*', $bin; } my $lookup = ''; while( <SMALLFILE> ) { chomp; $lookup{encode($_)} = 1; } while( <BIGFILE> ) { chomp; print if $lookup{encode($_)}; }

      As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".

        In reverse order:

        1. As is, both my code and yours rely on all strings being the same length, but that's easily "fixed".

          From the OP:

          (all same length, pretty short) ... (again, all same length as before, ...)

          If this were not the case, my bit-string encoding wouldn't work as inputs cc and cca and ccaaa ect. would all be encoded the same as ccaaaaaaaaaaaaaa.

        2. You could pack them smaller if memory is a concern

          Using your compacted hash keys, you save almost exactly 20% of memory relative to a normal hash:

          c:\test>hashsize -N=1e3 normalHash: 58156 compactedHash: 46156 c:\test>hashsize -N=1e4 normalHash: 605596 compactedHash: 485596 c:\test>hashsize -N=1e5 normalHash: 5924348 compactedHash: 4724348

          Which projecting out to the OPs (tens of millions) of strings in the smaller file, means your compacted hash will require 480MB for 10million which is about the same size as my bitvector.

          However, if the number of keys grows to 20e6, you need 960MB and for 30e6 1440MB and so on, which mean that on your average 32-bit machine, you're going to be limited to low 10s of millions, whereas my bitvector will never require more than 512MB.

        3. Finally, compacting your keys means that lookups in your hash are going to be slower than a ordinary hash (by almost an order of magnitude in a quick test).

          By contrast, as I showed here using vec as a lookup mechanism is over an order of magnitude faster than a hash lookup.

        So, a bounded and reasonable memory requirement and performance faster than a native hash. Got to be worth a mention don't you think?


        That said, there are a couple of problems with the untested idea. The first is trivial to fix, I made a thinko in my encode() sub:

        vec( $string, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

        should have been:

        ## vvvvv vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

        The second is rather less tractable as it constitutes a bug in Perl, specifically a bug in vec. Whilst the bitvector is only 512MB, the offsets can take the full unsigned integer range 0 .. 2**32.

        However, vec treats the offset values passed as IVs rather than UVs and so any offset >2**31 results in: Negative offset to vec in lvalue context ....

        This is fairly easy to work around--break the bitvector into two chunks to keep the offsets < 2**31--but the conditionals and math slows thing down somewhat. It is still faster than a normal hash, but none the less annoying as it shouldn't be necessary. I think the following changes to vec would fix the problem:

        ## sv.c PP(pp_vec) { dVAR; dSP; dTARGET; register const IV size = POPi; // register const IV offset = POPi; register const UV offset = POPi; register SV * const src = POPs; ## doop.c UV //Perl_do_vecget(pTHX_ SV *sv, I32 offset, I32 size) Perl_do_vecget(pTHX_ SV *sv, U32 offset, I32 size) ... void Perl_do_vecset(pTHX_ SV *sv) { dVAR; // register I32 offset, bitoffs = 0; register U32 offset register I32 bitoffs = 0; register I32 size; register unsigned char *s; register UV lval; I32 mask; STRLEN targlen; STRLEN len; SV * const targ = LvTARG(sv); if (!targ) return; s = (unsigned char*)SvPV_force(targ, targlen); if (SvUTF8(targ)) { /* This is handled by the SvPOK_only below... if (!Perl_sv_utf8_downgrade(aTHX_ targ, TRUE)) SvUTF8_off(targ); */ (void) Perl_sv_utf8_downgrade(aTHX_ targ, TRUE); } (void)SvPOK_only(targ); lval = SvUV(sv); offset = LvTARGOFF(sv); // if (offset < 0) // Perl_croak(aTHX_ "Negative offset to vec in lvalue context");

        I'd raise a patch but as it takes me about 3 days to download bleedperl, by the time I have it its already potentially out-of-date, so any patch might need to be manually applied anyway. If you agree this is a bug and are set up to produce patches against bleed, you might consider doing it?

        No doubt I will be condemned for my emphasis on performance, but that was the OPs primary concern: What approach would be really super-fast?.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Just simple code and very fast lookup.

      For a mere mortal like myself, the code isn't that simple :-)

      Sure, the lookup is very fast, but how about the indexer function encode()?
      vec( $bits, $_, 2 ) = $acgt{ substr $string . 'a' x 16, $_, 1 } for 0 .. 15;

      Is it worth the effort looping concatenations through substr()'s and performing hash lookup before vec()torizing into $bits, followed by unpack()ing it? This overhead is less noticeable, right? Just curious.

      Note: I made the correction for $bits.
        or a mere mortal like myself, the code isn't that simple :-)

        ... looping concatenations through substr()'s and performing hash lookup before vec()torizing into $bits, followed by unpack()ing it?

        Your description seems spot on, so it's not that complicated ;)

        This overhead is less noticeable, right?

        Less noticable than what? See Re^3: Comparing strings (exact matches) in LARGE numbers FAST for the full skinny, but in summary:

        The memory requirement is fixed (512MB) for any number of keys, and the lookup is at least as fast as a native hash which would break the memory of a 32-bit machine with around 50e6 keys.

        If the bug in vec is fixed, it will be significantly faster than a native hash.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      I missed the bit that said that all strings are the same length... so I started worrying that this wouldn't work with variable length strings, because a 2 bit code treats 'CG' as 'CGAAAAAA..A' or 'A..AAACG' (assuming 'A' is encoded as 0).

      FWIW: if the requirements were to change:

      My first thought was to encode 4 bits of length, followed by 2 bit symbols. In 32 bits you can then encode 14 character symbols, or in 36 bits, 16. [I rejected adding a fifth symbol (end). Encoding at 3 bits per symbol would be longer for strings > 4 symbols. And I couldn't face base 5, which in any case would be longer for strings > 12 symbols.]

      The best approach is to use a separate bit vector for each string length -- making sure that the encoding is left-justified -- so, for example, 8 character strings are encoded as 0..65535. All possible strings up to 16 symbols now require 1G bytes in 16 bit vectors.

      As you say, this is more or less bearable, but each extra symbol multiplies the memory requirement by 4 :-(.

Re: Comparing strings (exact matches) in LARGE numbers FAST
by snoopy (Curate) on Aug 29, 2008 at 00:06 UTC
    >> Note: the strings are made of only letters. More exactly, 4 letters - nucleotide bases (A,C,G,T).

    Is this related genome matching, sequencing etc?

    If you haven't already, check out the wide range of resources available from bioperl.

    There's lots of interesting reading, algorithms and freely available Perl code.

      Thanks, I'll check the BioPerl resources. Yes, the first file is ultra high throughput sequencing output. No, it's not genome matching - there is software that does that. The second file would be a collection of bio-markers if you will.
Re: Comparing strings (exact matches) in LARGE numbers FAST
by moritz (Cardinal) on Aug 28, 2008 at 22:45 UTC
Re: Comparing strings (exact matches) in LARGE numbers FAST
by repellent (Priest) on Aug 29, 2008 at 01:59 UTC
    Given that your lines are already unique, this is the shell-way to find the intersection of 2 uniq files:
    $ sort file1.txt file2.txt | uniq -d > matches.txt

    This approach would work on very large files efficiently.
      That is very true, Unix-like utilities are designed to work smartly with very large files...Will try.
      Thanks!!!
Re: Comparing strings (exact matches) in LARGE numbers FAST
by dhosek (Beadle) on Aug 28, 2008 at 23:48 UTC
    If the strings are short enough, you could pack them into integer values, but that means that you're limited to 16 letters (for 32 bit integers) or 32 letters (for 64 bit integers). Getting there is an interesting challenge. We can map {'A', 'C', 'T', 'G'} to {0,1,2,3} by taking ord(x)&0x6>>1. You'll probably want to look at the perldoc for the pack and unpack functions for some more specific ideas on how to get there.
Re: Comparing strings (exact matches) in LARGE numbers FAST
by FlatBallFlyer (Acolyte) on Aug 28, 2008 at 23:04 UTC
    It sounds to me like you would be better off writing code to slurp the files into a database with indexes. That would make finding exact matches run fast (Most databases use reg-green binary trees for indexes). The data import would probably be the biggest time-hog, I would suggest finding a way to import the data directly from the text files into the database, and then using DBI to manipulate the data. That's just my 2 cents.
Re: Comparing strings (exact matches) in LARGE numbers FAST
by BrowserUk (Patriarch) on Aug 29, 2008 at 01:08 UTC
Re: Comparing strings (exact matches) in LARGE numbers FAST
by JavaFan (Canon) on Aug 29, 2008 at 11:52 UTC
    I wouldn't use Perl; there are tools available to do this, and any POSIX compliant system will have them.
    $ sort file1 > file1.sorted $ sort file2 > file2.sorted $ comm -12 file1.sorted file2.sorted

    In bash (in several other shells probably as well, but I don't know them well enough), you can even do it as a one-liner, without the use of temporary files (*):

    $ comm -12 <(sort file1) <(sort file2)

    (*) Well, sort may use temporary files if the file to be sorted cannot be held in memory, but it'll clean up afterwards.

      Thanks, I completely forgot about the comm command...Since I am going to use the same 2nd (larger) file run after run , I will sort the file and save it sorted, so I won't have to do that over and over.
      Sorting such a large file may take close to an hour I would think... On my machine, sorthing a file with a few million strings like that takes at least a few minutes.
Re: Comparing strings (exact matches) in LARGE numbers FAST
by bioinformatics (Friar) on Aug 29, 2008 at 06:29 UTC
    Are there certain categories of the sequence data? Are we talking ChIP-seq style short sequences, or longer range sequences from PCR products or plasmids (not that it terribly matters I suppose). Any way to break the problem in to pieces helps. It sounds like you want to do a large scale blast program essentially. Bioperl doesn't have anything that will scale THAT well, and the current modules use temp files, so you will have a lot of harddrive access, further slowing down the process.

    You could try taking the sequence and translating it into the alphabetical code for amino acids. This helps in a couple of ways. It shortens the amount of data that has to be run through by any pattern matching algorithm, and it also decreases the repetition of the sequence. The built in Perl method (correct me if I'm wrong folks) matches one letter at a time so if the first three letters are ACG, then it goes til it finds A, then looks for a C, etc. There are a lot of repetitive sequences in DNA, and over time this builds up and takes more time to go through. With 20 or so amino acids and less repetition, it would remove some of the wasted time on the rabbit trails so to speak. Potentially this would free up memory to load more data in at a time.

    What do you mean by biomarkers? Are we talking transcription factor binding sites? There is an object oriented programming framework called TFBS (bioinformatics, vol 18 no 8 2002) that is compatible with bioperl and would likely make this pattern matching routine more efficient as it is designed for DNA code...
    Bioinformatics
      Hi bioinformatics, That is a very interesting idea. Yes, it's like a blast and I was in fact considering taking the 2nd file, perhaps concatenating a bit those sequences, and making a blastable database. Then I would megablast the first file against the 2nd, I have 8 processors, make the word size the length of the strings, boom.
      No, those are not certain categories of sequences. The first file is a sequencing output that could be 25-100 bp. The 2nd file sequences could be a lot of things, biologically. We are looking for certain "patterns" or "motifs" in the sequences.
        If you are scanning for motifs, you should be able to adapt TFBS to do that, as a binding site is a motif itself. On another note, STORM is another software program that is designed to do such searches, and can be integrated with a database (Statistical significance of cis-regulatory modules BMC Bioinformatics 2007, 8:19). I think that officially puts me out of ideas otherwise :)

        Bioinformatics
Re: Comparing strings (exact matches) in LARGE numbers FAST
by gone2015 (Deacon) on Aug 29, 2008 at 19:39 UTC

    As indicated elsewhere, the quick and easy way is to use somebody else's carefully crafted sort program to sort both files and then run them against each other.

    However, you did ask:

    What approach would be really super-fast?

    If the quick and easy approach is more or less fast enough, stop now. If you're not going to do this very often, stop now. Otherwise...

    There are ~2 * 10^19 different 32 character strings. You have ~n * 10^7 of those and ~n * 10^8 you want to match against. This suggests a fairly low hit rate. Yes ?

    If so, then we apply rule some-small-integer, and try to hack the problem down to size. Rather than attempt a lot of exact matching, I suggest the trick here is first to use some inexact matching, to eliminate strings for which there is definitely no match.

    The obvious way to do that is to hash the strings into a largish integer and use a bit vector to record which hashes exist in the smaller file. I'd try a 30-bit hash, but 32 might work better (assuming a 512MB bit vector is not a shock to your system).

    The process is then:

    1. scan the smaller file and populate the bit-vector according to the hash values.
    2. scan the larger file and record any strings whose hash values appear in the bit-vector.
    3. with any luck, there are now a small number of candidate strings, small enoough that is to be held in memory -- in particular in a hash table.
    4. scan the small file again, and see if anything is found in the hash table.

    If both files are going to be different each time, we're done. Otherwise, you could build and save the bit-vector for the file that doesn't change, and eliminate step (1) in the above process (if the larger file is the unchanging one, the process needs to be adjusted to suit).

    Health warning: If the strings are generally pretty random (in a hand waving sort of a way), you'll probably be OK with a straightforward hash function. If there's any regularity, finding the right hash may be tricky, and one hash may not suit a different set of data with different properties. Anyway, be careful with the hash function and try before you buy !

    Final thought, if there's a really fast CRC-32 module somewhere, I'd try that for the hash. Oh, final final though, I'd instrument the code so that you can tell how well the hash etc. are working.

      Thanks, the idea of saving time by looking at imperfect matches first to reduce the number of comparisons needed will definitely save time.
      The larger file will be actually the same from one run to another, so yes, I've thought of storing either the sorted file (if I go with a Unix-like approach like comm or uniq) or the bitvector for that file to do it only once and save time in the subsequent runs.
      The strings are pretty random.
Re: Comparing strings (exact matches) in LARGE numbers FAST
by johndageek (Hermit) on Sep 02, 2008 at 17:01 UTC
    Just some idea on how to handle this
    sort primary file (100s of millions) split file into ~ 10 milion record sub files named for the last key contained in each file sort secondary file -> ssf1 you may sort multiple secondary files (ssf2, ssf3 etc) perl program read directory containing primary files by name create an array containing the file names (@pfn)(last key in the ar +ray) open ssf1 ssf2 ssf3 ssf4 $ssf1_record=" "; # or a value lower than lowest value $ssf2_record=" "; # or a value lower than lowest value $ssf3_record=" "; # or a value lower than lowest value $ssf4_record=" "; # or a value lower than lowest value foreach $pf (@pfn){ open primary file ($pf) read $pf records into a hash ($pfrh) while ($ssf1_record < $pf){ compare to hash if found{ action } read $ssf1 record; } while ($ssf2_record < $pf){ compare to hash if found{ action } read $ssf2 record; } }

    Enjoy!
    Dageek

Re: Comparing strings (exact matches) in LARGE numbers FAST
by Anonymous Monk on Sep 01, 2008 at 19:41 UTC
    Create a prefix tree for the dictionary that you're trying to match against. Also, add a data element to each node in the tree to mark whether or not it is a valid terminal position.

    Then, for each search instance (i.e., each line in the second input file) search the prefix tree. If you end up on a node in your prefix tree that is also a valid terminal position then you have a match. Creating the prefix tree shall have an amortized linear time; searching the prefix tree for each element in a file is bounded in the worst case as taking n*(k/log(base)) time where n is the number of elements in the file and k is the length of the longest string that you're matching against and where base is the number of children each node has.

    This approach has the added benefit that once the prefix tree is created it can be reused later. -- jbs36