I have a database of about 150.000 string literals per day, and I need to group the similars ones into clusters.
I'm currently using the String::Similarity perl module. It works fine but I have some performance problems. With a 5000 literal collection the system takes 3minutes more or less.
Update:Consider I have 5000 news headlines and I want to group them by similarity. I actually want to cluster all headlines 80% similars. To speed up the process, I don't compare the headlines its headlines length if greater than 20%.
My code looks like:... my $docsProcessed; my $size = scalar (@{$arrayDocs}); for (my $i = 0; $i < ($size - 1); $i++) { # next if already processed next if (defined $docsProcessed->{$arrayDocs->[$i]}); for (my $j = $i + 1; $j < ($size - 1); $j++) { next if (defined $docsProcessed->{$arrayDocs->[$j]}); $similarity = similarity($arrayDocs->[$i], $arrayDocs->[$j], $se +lf->{THRESHOLD}); if ($similarity >= $self->{THRESHOLD}) { # Add the processed document into the cluster push (@{$clusters->{$arrayDocs->[$i]}}, $arrayDocs->[$j]); # Add the document to the processed hash $docsProcessed->{$arrayDocs->[$j]} = 1; } } } ...
I just wanted to know if there is other method, other modules o anything to speed up this process as much as possible.
Thank you very muchIn reply to Fast string similarity method by icanwin
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |