Quicksilver has asked for the wisdom of the Perl Monks concerning the following question:

I'm just trying an algorithm that does a fuzzy search for a pattern that Tachyon originally posted about a bioinformatics question. It is currently outputting the line, number of misses and I've got it to print the sentence that the occurrence appears in but I'm trying to get the word as well and I'd be grateful for some help. I'm just experimenting in applying some bioinformatics to text analysis (following a conversation with an acquaintance) and will be looking at using stop words, inflections and corpora in due course.
use strict; use warnings; my $word = "scrooge"; my @find = map ([split //], $word); my $find_len = length($word); my $fuzzy = 2; while (my $search = <DATA>) { chomp $search; $search = [split //, $search]; for my $i ( 0..@$search-$find_len ) { FIND: for my $find ( @find ) { my $misses = 0; for $j ( 0..$find_len-1 ) { $misses++ if $search->[$i+$j] ne $find->[$j]; next FIND if $misses > $fuzzy; } print "Line $. Match ($misses) at $i, @$search\n"; } } } __DATA__ STAVE I: MARLEY'S GHOST MARLEY was dead: to begin with. There is no doubt whatever about that. The register of his burial was signed by the clergyman, the clerk, the undertaker, and the chief mourner. Scrouge signed it: and Scrooge's name was good upon 'Change, for anything he chose to put his hand to. Old Marley was as dead as a door-nail. Mind! I don't mean to say that I know, of my own knowledge, what there is particularly dead about a door-nail. I might have been inclined, myself, to regard a coffin-nail as the deadest piece of ironmongery in the trade. But the wisdom of our ancestors is in the simile; and my unhallowed hands shall not disturb it, or the Country's done for. You will therefore permit me to repeat, emphatically, that Marley was as dead as a door-nail

Replies are listed 'Best First'.
Re: Trying to find a word in a fuzzy search algorithm
by Limbic~Region (Chancellor) on Jun 04, 2008 at 14:45 UTC
    Quicksilver,
    First, here is a list of modules that you may find helpful

    If you want to find the "word" that matched using some fuzzy approximation, you first have to define what a "word" is. Then all you do is parse your input text one "word" at a time, apply your measure, and report on match.

    You will find it harder than you think to define a word though if you want anything more than just rudimentary /\b(\w+)\b/. This is the crux of the problem.

    Cheers - L~R

      Hi L-R,
      I had started using these but hadn't quite got the results that I really wanted, ie I could get sentences and what have you but necessarily the word or different word. I'll take another look though and see if I can get a tighter result. At one level, I also wanted to play around with algorithms a little more and familiarise myself with some of them. As you rightly say, what is a word? I was looking mainly for a slightly more simple search for a user across a series of texts to return slight variations (if any exist) as a first off and then get into stop words, inflections et al. Thanks. Update: I also want to be able to give users the ability to search for a particular pattern if they wish or cluster of letters. Hence trying to adapt this algorithm.
Re: Trying to find a word in a fuzzy search algorithm
by moritz (Cardinal) on Jun 04, 2008 at 15:37 UTC
    I don't know if that helps you, but Perl 5.10.0 has pluggable regex engines. TRE is a regex engine that does fuzzy matching, and avar (iirc) packed it up in re::engine::TRE.

    I don't know if that package supports fuzzy matching, but it shouldn't be too hard to add if it doesn't.

    I intended to play with it for quite a while now, but haven't found time and motivation so far ;-)

Re: Trying to find a word in a fuzzy search algorithm
by Crackers2 (Parson) on Jun 04, 2008 at 18:06 UTC

    To answer your immediate question:

    print "Line $. Match ($misses) at $i, @$search [["; print @$search[$i..$i+$find_len-1]; print "]]\n";