Re: Quicker Array Searching
by dragonchild (Archbishop) on Oct 29, 2003 at 21:31 UTC
|
It'll run in O(n^2), which is reasonable for what you're doing. However, you can speed that up some, so it runs in more like n log n, for the non-pathological cases.
foreach my $itemIndex (1 .. $items->Count)
{
my ($body) = $items->Item($itemIndex)->Body =~ /MessageBody>(.*)<\
+/Mess/s;
if ($body !~ /\S/)
{
die "Some appropriate error on index $itemIndex.";
}
CACHE_CHECK:
foreach my $cachedItem (@cache)
{
my $compare = $cachedItem->[0];
my $len = length($body) > $length($compare) ? length($body) :
+length($compare);
if (distance($body, $compare) / $len < $threshold)
{
@toBeMoved[$itemIndex, $cachedItem->[1]] = (1, 1);
last CACHE_CHECK;
}
}
}
The primary differences here are:
- I have made @toBeMoved into kind of a bit-flag. You loop through that list later and only deal with the items marked true
- Once I find a match, I stop looking.
The second item is the major speed-up. You might not agree with it. If not, then get rid of the last. My thought was that your operation is transitive, or if A & B work and B & C work, then A & C will work. If that's not the case, then change it.
------
We are the carpenters and bricklayers of the Information Age.
The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6
... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms
Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.
| [reply] [d/l] |
Re: Quicker Array Searching
by BrowserUk (Patriarch) on Oct 29, 2003 at 22:55 UTC
|
I have to say that I seriously doubt that Text::Levenshtein is the right tool for this job.
The problems are
- the algorithm is designed for contrasting phrases, two at a time.
The numerical results of comparing A <-> B, and A <-> C, are not themselves comparable in any meaningful fashion.
Even comparing two identical phrases with different amounts of whitespace between the words, or different case or just capitalisation, will result in differing scores, even though the human eye would perceive them as being similar.
- The 'score' will be affected by every character, case, whitepsace, markup, length etc.
Simple transpositions of characters in a word, words in a sentance, or whole sentances will cause dramatic differences to be scored.
print distance( 'the quick brown fox jumps over the lazy dog', 'jumps
+over the lazy dog the quick brown fox jumps' );
# print '42';
There are an infinite number of completely unrelated pairs of sentances -- they could be on a different subject, the same subject using different words, the same words mispelt, or in an entirely different language -- that would produce a "match" for this score of 42.
You might get somewhere by applying the Levenshtein algorithm, but using the words (or their index in a lookup table) instead of characters, case-folded, excluding whitespace and markup, but you would still be fooled by simple typos and such.
The Levenshtein algorithm is all about comparative positions of similar elements. The results of any one comparision are not themselves comparable with any other.
It would be an interesting exercise to devise an algorithm that would not only compare bodies of text for similarity, but also produce a metric for each peice of text that could be compared directly with that from another.
It might work by producing a list of indices for the words (sans whitespace, markup, case etc) in a table. Then use the word indices to look up a prime number.
If 'A' was the first word in your table, it's index would be 0, that would then map to the prime 2. If 'an' was the second, that would be mapped to the prime 3, and so on.
You then multiply the prime for each word by the frequency of its occurance and sum the results.
You end up with a (huge) number that reflects both the words found and their frequencies in a semi unique way. You've thrown away 'extraneous' information, like case, whitespace etc, but also relative position. It still doesn't handle typos or foriegn languages, and it doesn't in anyway reflect the meaning of the text. The results might be interesting though.
It might be worth looking up expertise and algorithms for plagerism detection, though again, I think that these will tend to only detect similarity between two bodies of text, not produce a value by which more than two bodies can be cross related, nor scores that can themselves can be directly compared.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
| [reply] [d/l] |
Re: Quicker Array Searching
by davido (Cardinal) on Oct 29, 2003 at 21:27 UTC
|
Without looking too deeply at the specifics, if you're constructing an array of items already seen, and checing the next item against every item in the array to make sure you haven't already seen it, there is definately another way to do it. You might consider using a hash instead.
my @things = qw/homer bart marge maggie lisa kramer
jerry george elaine marge lisa/;
my %seen;
foreach my $thing ( @things ) {
next if exists $seen{$thing};
$seen{$thing}++;
# Do your stuff here.
}
The above snippet keeps track of whether or not something's been seen already, and if it has, it skips to the next thing. If it hasn't seen it before, it makes a note of that thing, and then lets you do whatever processing you want.
The hash lookup is much faster than grep on an array.
Dave
"If I had my life to live over again, I'd be a plumber." -- Albert Einstein
| [reply] [d/l] |
Re: Quicker Array Searching
by jonadab (Parson) on Oct 29, 2003 at 21:28 UTC
|
The bulk of your speed problem is inherent to the notion
of comparing the text of each message with the text of
each other message. If you could redesign your
algorithm to analyse each message once and create
some sort of numeric index, sort the messages by
their index, and then only compare the actual text of
ones that are "close", you can speed it up to maybe
O(n^2) or perhaps O(n log n) time. A modern system can
probably handle that for a few thousand messages.
O(n!) time, however, is right out of the question,
as you will discover if you try to run it.
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] |
Re: Quicker Array Searching
by sauoq (Abbot) on Oct 29, 2003 at 21:30 UTC
|
As can be seen, this is going to run in O((n-1)!)
I think that's actually O(N^2) which, although not blazingly fast, is much, much faster than O(N!).
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
|
|
| [reply] [d/l] |
|
|
| [reply] |
|
|
Re: Quicker Array Searching
by etcshadow (Priest) on Oct 29, 2003 at 21:42 UTC
|
Is there a way that you can create some kind of hash of the message, representing the similarity-key (like soundex, for example), and then use that hash-value of the message as the key in an index-hash?
I'm not familiar with Text::Levenshtein, but if it can work by returning some kind of a hash or a "score" of the message, then you can do something like:
foreach my $message (...) {
...
my $score_range = floor($score/$threshold);
$index{$score_range} ||= [];
push @{$index{$score_range}}, $message;
}
And then you can only do your comparison against those messages in whose the hash for $score_range-1, $score_range, and $score_range+1. I hope that psuedo-logic is followable... and if I messed up the boundary case a little, well, you can deal with that, but you get the idea.
------------
:Wq
Not an editor command: Wq
| [reply] [d/l] |
Re: Quicker Array Searching
by sauoq (Abbot) on Oct 29, 2003 at 21:53 UTC
|
It seems you could do a great deal better by altering your approach a bit.
The method would go something like this... You start with a list of emails and you compare the first against all the rest, saving the list of distances. Sort your emails by their distance from the 'first' email. Then you sort of compress the list making further comparisons only between emails with similar distances from the original email.
This is based on the assumption that two emails which have dissimilar distances from a third can't themselves be similar. (I think this is a safe assumption with Text::Levenshtein, but I really don't have any experience with it.) If it is a safe assumption, you can probably approach a linear time solution.
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
Re: Quicker Array Searching
by Nkuvu (Priest) on Oct 30, 2003 at 05:51 UTC
|
is going to run in O((n-1)!) (if my 5 year old math is right)
Wow, your five year old is doing algorithm analysis? Impressive. When I was five I didn't even know what an algorithm was...
| [reply] |