Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script right now that has to search through an email inbox with the intent of moving out emails that are close enough to one another (using Text::Levenshtein). Right now I'm keeping one array that keeps track of emails we've examined so far and have found no matches for, and as I go along I compare each new email in the INBOX to each item in the array, either marking it to be moved if found or pushing it onto the array if not.

As can be seen, this is going to run in O((n-1)!) (if my 5 year old math is right) which can get a bit bogged down since we're talking about very large numbers of emails here (on the order of 7000). Anyone have any ideas on ways to make this more efficient? I know hashes would search faster than an array, but the text of the emails aren't going to be exactly the same, just very similar.

for my $itemIndex (1..$items->Count) { my $didWeMove = 0; (my $body) = ($items->Item($itemIndex)->Body =~ /MessageBody>(.*)< +\/Mess/s); if ($body eq " ") { die "\nError: Found a message not in the proper format. (No <M +essageBody> tag)"; } foreach $letterArray (@letterCache) { my $letter = @$letterArray[0]; my $len = (length($body)>length($letter)) ? length($body) : le +ngth($letter); if (distance($body, $letter)/$len < $threshold) { push(@toBeMoved, $itemIndex); #If we found one, push the i +ndex onto the list... push(@toBeMoved, @$letterArray[1]); #...and push the origi +nal onto the list... $didWeMove = 1; #And set this to true. } } if (!$didWeMove) { #If we did not find a match, this is a unique +email (so far) push (@letterCache, [$body, $itemIndex]); }

Any help?

Thanks,
Ross

Replies are listed 'Best First'.
Re: Quicker Array Searching
by dragonchild (Archbishop) on Oct 29, 2003 at 21:31 UTC
    It'll run in O(n^2), which is reasonable for what you're doing. However, you can speed that up some, so it runs in more like n log n, for the non-pathological cases.
    foreach my $itemIndex (1 .. $items->Count) { my ($body) = $items->Item($itemIndex)->Body =~ /MessageBody>(.*)<\ +/Mess/s; if ($body !~ /\S/) { die "Some appropriate error on index $itemIndex."; } CACHE_CHECK: foreach my $cachedItem (@cache) { my $compare = $cachedItem->[0]; my $len = length($body) > $length($compare) ? length($body) : +length($compare); if (distance($body, $compare) / $len < $threshold) { @toBeMoved[$itemIndex, $cachedItem->[1]] = (1, 1); last CACHE_CHECK; } } }

    The primary differences here are:

    1. I have made @toBeMoved into kind of a bit-flag. You loop through that list later and only deal with the items marked true
    2. Once I find a match, I stop looking.

    The second item is the major speed-up. You might not agree with it. If not, then get rid of the last. My thought was that your operation is transitive, or if A & B work and B & C work, then A & C will work. If that's not the case, then change it.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Quicker Array Searching
by BrowserUk (Patriarch) on Oct 29, 2003 at 22:55 UTC

    I have to say that I seriously doubt that Text::Levenshtein is the right tool for this job.

    The problems are

    • the algorithm is designed for contrasting phrases, two at a time.

      The numerical results of comparing A <-> B, and A <-> C, are not themselves comparable in any meaningful fashion. Even comparing two identical phrases with different amounts of whitespace between the words, or different case or just capitalisation, will result in differing scores, even though the human eye would perceive them as being similar.

    • The 'score' will be affected by every character, case, whitepsace, markup, length etc.

      Simple transpositions of characters in a word, words in a sentance, or whole sentances will cause dramatic differences to be scored.

      print distance( 'the quick brown fox jumps over the lazy dog', 'jumps +over the lazy dog the quick brown fox jumps' ); # print '42';

      There are an infinite number of completely unrelated pairs of sentances -- they could be on a different subject, the same subject using different words, the same words mispelt, or in an entirely different language -- that would produce a "match" for this score of 42.

    You might get somewhere by applying the Levenshtein algorithm, but using the words (or their index in a lookup table) instead of characters, case-folded, excluding whitespace and markup, but you would still be fooled by simple typos and such.

    The Levenshtein algorithm is all about comparative positions of similar elements. The results of any one comparision are not themselves comparable with any other.

    It would be an interesting exercise to devise an algorithm that would not only compare bodies of text for similarity, but also produce a metric for each peice of text that could be compared directly with that from another.

    It might work by producing a list of indices for the words (sans whitespace, markup, case etc) in a table. Then use the word indices to look up a prime number.

    If 'A' was the first word in your table, it's index would be 0, that would then map to the prime 2. If 'an' was the second, that would be mapped to the prime 3, and so on.

    You then multiply the prime for each word by the frequency of its occurance and sum the results.

    You end up with a (huge) number that reflects both the words found and their frequencies in a semi unique way. You've thrown away 'extraneous' information, like case, whitespace etc, but also relative position. It still doesn't handle typos or foriegn languages, and it doesn't in anyway reflect the meaning of the text. The results might be interesting though.

    It might be worth looking up expertise and algorithms for plagerism detection, though again, I think that these will tend to only detect similarity between two bodies of text, not produce a value by which more than two bodies can be cross related, nor scores that can themselves can be directly compared.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

Re: Quicker Array Searching
by davido (Cardinal) on Oct 29, 2003 at 21:27 UTC
    Without looking too deeply at the specifics, if you're constructing an array of items already seen, and checing the next item against every item in the array to make sure you haven't already seen it, there is definately another way to do it. You might consider using a hash instead.

    my @things = qw/homer bart marge maggie lisa kramer jerry george elaine marge lisa/; my %seen; foreach my $thing ( @things ) { next if exists $seen{$thing}; $seen{$thing}++; # Do your stuff here. }

    The above snippet keeps track of whether or not something's been seen already, and if it has, it skips to the next thing. If it hasn't seen it before, it makes a note of that thing, and then lets you do whatever processing you want.

    The hash lookup is much faster than grep on an array.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: Quicker Array Searching
by jonadab (Parson) on Oct 29, 2003 at 21:28 UTC

    The bulk of your speed problem is inherent to the notion of comparing the text of each message with the text of each other message. If you could redesign your algorithm to analyse each message once and create some sort of numeric index, sort the messages by their index, and then only compare the actual text of ones that are "close", you can speed it up to maybe O(n^2) or perhaps O(n log n) time. A modern system can probably handle that for a few thousand messages. O(n!) time, however, is right out of the question, as you will discover if you try to run it.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Quicker Array Searching
by sauoq (Abbot) on Oct 29, 2003 at 21:30 UTC
    As can be seen, this is going to run in O((n-1)!)

    I think that's actually O(N^2) which, although not blazingly fast, is much, much faster than O(N!).

    -sauoq
    "My two cents aren't worth a dime.";
    
      I think that's actually O(N^2)

      You may be right; I wasn't careful with my analysis of the speed of the algorithm, other than in terms of how to improve it.

      which, although not blazingly fast, is much, much faster than O(N!)

      No joking.


      $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
        No joking.

        Heh. Nevermind 7000 messages... 17 would be intractable. :-)

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Quicker Array Searching
by etcshadow (Priest) on Oct 29, 2003 at 21:42 UTC
    Is there a way that you can create some kind of hash of the message, representing the similarity-key (like soundex, for example), and then use that hash-value of the message as the key in an index-hash?

    I'm not familiar with Text::Levenshtein, but if it can work by returning some kind of a hash or a "score" of the message, then you can do something like:

    foreach my $message (...) { ... my $score_range = floor($score/$threshold); $index{$score_range} ||= []; push @{$index{$score_range}}, $message; }
    And then you can only do your comparison against those messages in whose the hash for $score_range-1, $score_range, and $score_range+1. I hope that psuedo-logic is followable... and if I messed up the boundary case a little, well, you can deal with that, but you get the idea.

    ------------
    :Wq
    Not an editor command: Wq
Re: Quicker Array Searching
by sauoq (Abbot) on Oct 29, 2003 at 21:53 UTC

    It seems you could do a great deal better by altering your approach a bit.

    The method would go something like this... You start with a list of emails and you compare the first against all the rest, saving the list of distances. Sort your emails by their distance from the 'first' email. Then you sort of compress the list making further comparisons only between emails with similar distances from the original email.

    This is based on the assumption that two emails which have dissimilar distances from a third can't themselves be similar. (I think this is a safe assumption with Text::Levenshtein, but I really don't have any experience with it.) If it is a safe assumption, you can probably approach a linear time solution.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Quicker Array Searching
by Nkuvu (Priest) on Oct 30, 2003 at 05:51 UTC
    is going to run in O((n-1)!) (if my 5 year old math is right)
    Wow, your five year old is doing algorithm analysis? Impressive. When I was five I didn't even know what an algorithm was...