Improving speed match arrays with fuzzy logic

Takamoto has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I have a first long list (array) of n-grams (array various length, typical: 10.000 elements) and a second even bigger reference list with n-grams (~500.000 elements). I need to check if elements of list 1 are present in list 2. The match must be fuzzy, i.e. both lists are words, and I need to match also very similar words (basically to match singular-plurals, small differences in spellings, and so on). My script (very naive) works, but I'd like to hear from you some suggestions to improve its speed (which is quite crucial in my application). I already had a BIG gain in time switching from Text::Fuzzy::PP to Text::Fuzzy, but surely the script could further improved. Data looks like this (each line is one element of multi word units):

IP exchange
IPX
Internet Protocol Exchange
DPM
Data Point Model
[download]

My code

use strict;
use warnings;
use Text::Fuzzy;
use Data::Dumper;

my @stringsToMatch=("extra-articular arthrodesis", "malnutritions");#m
+y first list. I want to check if its elements are in my reference lis
+t $referenceNgrams
my @goodmatches;

my $referenceNgrams="EN.txt";#my huge reference file (second list)

print "Loading n-grams from $referenceNgrams\n";
open my $inFH, '<:encoding(UTF-8)', $referenceNgrams or die;
chomp(my @Corpus = <$inFH>);
close $inFH;
        
#Matching with Fuzzy
foreach my $stringToMatch (@stringsToMatch){
    print "Working on $stringToMatch\n";
    foreach my $corpusElement (@Corpus){
        #matching only if $stringToMatch has the same amount of elemen
+ts of $corpusElement (to save time?)
        my $elementsInstringToMatch = 1 + ($stringToMatch =~ tr{ }{ })
+;
        my $elementsIncorpusElement = 1 + ($corpusElement =~ tr{ }{ })
+;
        
        if ($elementsIncorpusElement eq $elementsInstringToMatch){
            my $tf = Text::Fuzzy->new ($stringToMatch);
            my $distance= $tf->distance ($corpusElement);
            if ($distance < 2){#sensibility
                push (@goodmatches, $stringToMatch);
                last;#go out of loop if match has been found
            }
        }
    }
}

print "Good matches:\n";
print Dumper @goodmatches;
[download]

Comment on Improving speed match arrays with fuzzy logic Select or Download Code

Replies are listed 'Best First'.
Re: Improving speed match arrays with fuzzy logic by Your Mother (Archbishop) on Jan 18, 2019 at 18:45 UTC
I expect you’ll be interested in these: UMLS::Similarity, UMLS::Interface, and a general search of the CPAN in the UMLS::* space. It's not trivial to set up but the instruction are good. Unfortunately there are no free medical stemming dictionaries I'm aware of. This really helps the kind of comparison task you're doing. If you know or learn of any, please do share. There are other approximate matching packages. Some like the metaphone stuff are probably too fuzzy for you and others like String::Approx are probably too slow. You might want to index all your data first and use something like Lucy (to do stemming, tokenizing, and various normalization once instead of per search/match); or possibly resort to specialized data containers like Judy.	[reply]
Re^2: Improving speed match arrays with fuzzy logic by Takamoto (Monk) on Jan 19, 2019 at 10:00 UTC
Thank you, Your Mother. I did not know the modules you pointed to! And they open up new horizons to me, as I am mostly interested in natural language-related tasks. I agree, stemming is a good approach, however so much dependent on available resources (language, domain, etc.) that - as for now - its usefulness is limited to general language. At least this is my experience.	[reply]
Re: Improving speed match arrays with fuzzy logic by bliako (Monsignor) on Jan 18, 2019 at 20:44 UTC
Firstly, I would create `$tf` from `my $tf = Text::Fuzzy->new ($stringToMatch);` outside the inner loop. There is no need to re-create that object for the same base word (`$stringToMatch`) over and over each item in the corpus. Since the number of stringsToMatch is much less than Corpus, it is a good thing (for this suggestion) that the loops are arranged the way they are. If you are going to repeat above process for different `@stringsToMatch` but the same `@Corpus` and you have RAM to spare then try creating one `Text::Fuzzy` object for each corpus element ONCE and cache these for each subsequent run. Another option is before calling `$tf->distance()`, which I assume it is expensive, to have a few relatively cheap heuristics of your own to decide whether it is going to be a waste of time to call the expensive distance(). You are already using one by checking on the number of words in each string. If the distance function's performance is worse than linear to the length of strings then maybe break your strings to words and use the mean or max of their distances. Fewer letters => faster even if more words. If the distance algorithm you are using is too expensive then experiment with converting your words to phonetic representation using one of the algorithms in Text::Phonetic. That could accommodate for different spellings at least (possibly). And of course there is the route of AI and neural networks. But I need to dust these off. p.s. With heuristics you must keep statistics of their performance in order to assess them later. bw, bliako	[reply] [d/l] [select]
Re^2: Improving speed match arrays with fuzzy logic by Takamoto (Monk) on Jan 19, 2019 at 10:06 UTC
Very nice suggestions, bliako, to improve my script! AI is for sure the way to go in many areas (not sure here). Unfortunately Perl offers almost nothing in the area (probably the only notable exception is AI::MXNet, but I haven't be able to configure and use it... I should give it another try)	[reply]
Re^3: Improving speed match arrays with fuzzy logic by Anonymous Monk on Jan 19, 2019 at 11:58 UTC
If inflection (plural/singular) differences appear a lot then there is a module which converts a word to plural: Lingua::EN::Inflect. It is very simple to use and not slow. From manpage: `print "same\n" if PL_eq($word1, $word2);`. Phonix (and variations, e.g. Metaphone) are algorithms for collapsing words to a phonetic space (which is simpler for retrieval, e.g. w.r.t. spelling). There are modules available for doing that. It's probably worth applying fuzzy-distance to phonetic space and see if that works faster without loss of accuracy. Apropos MXNet, there is an LSTM module judging from this: https://metacpan.org/source/SKOLYCHEV/AI-MXNet-1.33/examples/char_lstm.pl . If you can make use of something like the procedure described in this: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ Using MXNet's Perl modules requires you to download MXNet from https://mxnet.apache.org/versions/master/install/index.html?platform=Linux&language=Python&processor=CPU and install it, maybe compile it too.	[reply] [d/l]
Re^4: Improving speed match arrays with fuzzy logic by bliako (Monsignor) on Jan 19, 2019 at 12:00 UTC
Re^5: Improving speed match arrays with fuzzy logic by erzuuli (Cannon) on Jan 19, 2019 at 16:27 UTC
Re: Improving speed match arrays with fuzzy logic by tybalt89 (Monsignor) on Jan 19, 2019 at 13:52 UTC
"... and so on" Since the specification is fuzzy, let's make a fuzzy matching regex :) Then match it against the whole corpus as a single string, instead of doing 500,000 individual matches. #!/usr/bin/perl # https://perlmonks.org/?node_id=1228728 use strict; use warnings; # corpus is now a string instead of an array FIXME for real filename my $corpus = do { local (@ARGV, $/) = '/usr/share/dict/words'; <> }; # fake random input strings FIXME for real strings in @tomatch my @tomatch = map { join '', map { ('a'..'z')[rand 26] } 1 .. 4 } 1 .. + 1e2; for my $string (@tomatch) { my @patterns; # match <2 changes push @patterns, "$`.?$'" while $string =~ /\S/g; # changed or droppe +d char push @patterns, "$`.$'" while $string =~ /\|/g; # added char $string =~ /^(.+)es$/ && push @patterns, $1; # singular my $fuzzyregex = do { local $" = '\|'; qr/^(@patterns)$/m }; $corpus =~ $fuzzyregex && printf "%35s : %s\n", $string, $1; # FIXME + output } [download] Besides, I couldn't pass up an opportunity to write perl to write a regex :)	[reply] [d/l]
Re^2: Improving speed match arrays with fuzzy logic by bliako (Monsignor) on Jan 19, 2019 at 16:55 UTC
Excellent! Since the worms are well out of the can (and pretty fast too!), can I add `s\|z`, `(f\|ph)` and add a plural case (rather than just making a singular)? `... # $string =~ /^(.+)es$/ && push @patterns, $1; # singular # Bliako modified: ($string =~ /^(.+?)(e?s)$/ && push @patterns, $1) # singular \|\| push @patterns, $string.'(e?s)?' ; s/(?<![sz])(?:s\|z)(?=[^\)\.])/(?:s\|z)/ for(@patterns); s/f\|ph/(:?f\|ph)/ for(@patterns); #print "patterns:"; print " /$_/" for(@patterns); print "\n"; # end mods ...` [download]	[reply] [d/l] [select]
Re^3: Improving speed match arrays with fuzzy logic by tybalt89 (Monsignor) on Jan 19, 2019 at 17:52 UTC
With a specification of "and so on" you can add anything you want :)	[reply]
Re: Improving speed match arrays with fuzzy logic by pryrt (Abbot) on Jan 18, 2019 at 18:46 UTC
The match must be fuzzy, ... and I need to match also very similar words ... differences in spellings, and so on... Just be careful not to accidentally target the Missal of Silos. (Sorry, the timing of your post just days after that comic came out was too much not to comment on.)	[reply]
Re^2: Improving speed match arrays with fuzzy logic by Takamoto (Monk) on Jan 19, 2019 at 10:09 UTC
Ahaha, very nice indeed! Always good to be remained of pitfalls. Fuzzy matches are not a panacea, of course, but in some areas (depending on the matching goal) may be a viable solution to generalize the matching task.	[reply]