matthewsnape has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am currently working on a script that compares every cell in a table (about 300) with a list of strings (5000). Initially I just compared each cell with every possible string in the list. Of course this was very time consuming.

My latest idea has been to split the list into a series of arrays for each letter in the alphabet and then only compare strings with corresponding first letters. However, this seems over complicated. What is the best way of comparing one list of strings with another?

  • Comment on comparing a string with lots of other strings

Replies are listed 'Best First'.
Re: comparing a string with lots of other strings
by davido (Cardinal) on Jan 13, 2005 at 00:29 UTC

    A good hashing algorithm works a little like you're considering implementing by hand. Perl's hashing system uses "buckets". There may be a few or many items in each bucket. And each bucket itself is found quickly by the hashing algorithm. The result is a very efficient method where the correct bucket is quickly found, and then its contents are scanned for the particular item needed (this is a rough approximation of reality). Anyway, in the end, you get approximately O(1) lookups, which means constant time regardless of number of items in the lookup list.

    The moral of this story is to use a hash to hold your 5000 strings. If they won't fit in memory that way, it's time to use a database solution and query for the needed string. But for maximum efficiency, just use the hash.


    Dave

      Or, if the 5,000 strings won't fit in memory use your hash to store the 300 table cells and loop over the 5,000 strings instead.
        ...or using a tied hash. ;)

        thor

        Feel the white light, the light within
        Be your own disciple, fan the sparks of will
        For all of us waiting, your kingdom will come

Re: comparing a string with lots of other strings
by borisz (Canon) on Jan 12, 2005 at 23:37 UTC
    use a hash if you have enough memory.
    my @list1; my @list2; my %t; @t{@list2} = (); for ( @list1 ) { print "$_ found" if exists $t{$_}; }
    Boris

      I realise that the original question was comparing 300 strings to 5000. I'd like to just point out that using a hash like this is something I do even when comparing a handful of strings to, say, 50 strings. Much like The Force, hashes are useful for small and large problems. Doing a direct compare is only a good idea if one of the two lists to compare is a list of one. For lists of 2-5, it may be ok, but anything larger, I'd use the hash.

Re: comparing a string with lots of other strings
by Anonymous Monk on Jan 13, 2005 at 00:28 UTC

    When you have many records to check in the way you are describing, perhaps is time to consider a database. What you are asking could be done very easily with a DBMS.

    Checking for strings in both lists (INNER JOIN) or for strings in the first list without corresponding items in the second list (LEFT OUTER JOIN) are easy tasks for a DBMS.
    Once you have paid the price of inserting the lists in two database tables and setting up the appropriate indexes, you have a dedicated server that can solve this kind of problems much better than any implementation you or me can write in any language.

    If you don't want to commit to a complex client/server database, you may give a thought to DBD::SQLite.