in reply to Ignoring spelling mistakes

How many files? How big? How many strings to look for? What time pressure (ie. once a year; once a week; once an hour)?

Fuzzy matching is a notoriously slow process. What approaches are likely to work best very much depends upon the scale of the problem space.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"

Replies are listed 'Best First'.
Re^2: Ignoring spelling mistakes
by manna45 (Novice) on Feb 05, 2010 at 04:25 UTC
    The files are excel files, approximately 5000-10000 lines, and all the rows have lines that I need to look for. There are 2-3 files to be checked once or twice a month but the checking process shouldn't be too slow. I don't want my boss sitting and drumming his fingers and cursing me while the check is in progress:)
      approximately 5000-10000 lines ... 2-3 files

      So, is that 15,000 to 30,000 "phrases" to be compared against each other, (225 million to 900 million comparisons), once or twice a month?

      How long is a "line"? Does each line consist of a single phrase for comparison? Or phrases wrapped across lines?

      You really need to be a lot more precise in your estimates of your problem if your going to stand a chance of doing it in a timely manner. That said.

      I don't want my boss sitting and drumming his fingers and cursing me while the check is in progress:)

      If this is only once or twice per month, can't he set it going (or schedule it) overnight so that he can just look at the results in the morning?


      To give you a feel for what you are up against, I took the text of a longish novel--1.5 MB 1/4 million words--and broke it up (crudely) into 24,685 sentences. I lowercased them, stripped all puctuation and numerics and normalised the whitespace. I then ran a straight string compare ($string[ $a ] eq $string[ $b ]):

      C:\test>821344 821344.dat 24685 Took 66.675 46546 : 304711641

      It took just over a minute to find 46,000 matches amongst the 304 million comparisons. So, an average rate of 4.5 million comparisons per second.

      I then substituted String::Approx::amatch() for eq and ran it on the first 1000 & 2000 sentences using the default settings:

      C:\test>821344 -N=1000 821344.dat 1000 Took 6.598 4732 : 501501 C:\test>821344 -N=2000 821344.dat 2000 Took 25.470 22044 : 2003001

      which shows an average rate of ~ 77,000 per second.

      Multiply that out and you get 1hr 10 minutes for the whole file.

      Not too bad, but what will it tell you? He are a few examples of "matches" from the first screenful of non-exact matches:

      come up kinch come out kinch he thinks youre not a gentleman hes stinking with money and thinks youre not a gentleman god or leave it there all day forgotten friendship god so i carried the boat of incense then at clongowes i a sail veering about the blank bay waiting for a swollen bundle to bob + up roll over to the sun a puffy face saltwhite why they bundled their books away pencils clacking pages rustling buck mulligan said buck mulligan sighed tragically and laid his hand on stephens arm a servant too i am a servant of two masters stephen said an english and an italian

      As you can see, it produces a lot of false hits. ~4000 from the first 1000 lines. That's going to be a huge amount to wade through.

      You can maybe improve that (and speed things up somewhat), by excluding any short sentences. If you limit it to 4 words or greater, the first 1000 lines finds only 14 non-exact matches:

      C:\test>821344 -N=1000 -D=1 -MIN=4 821344.dat Use of implicit split to @_ is deprecated at C:\test\821344.pl line 22 +. the mockery of it the mockery of it he said contentedly he thinks youre not a gentleman hes stinking with money and thinks youre not a gentleman the cracked lookingglass of a servant cracked lookingglass of a servant the cracked lookingglass of a servant that one about the cracked lookingglass of a servant being the symbol +of irish art is deuced good cracked lookingglass of a servant that one about the cracked lookingglass of a servant being the symbol +of irish art is deuced good i dont remember anything i cant remember anything irish buck mulligan said hes english buck mulligan said and he thinks we ought to speak irish i +n ireland irish buck mulligan said our swim first buck mulligan said irish buck mulligan said give us that key kinch buck mulligan said to keep my chemise flat haines said to her pay up and look pleasant haines said to him smiling he turned to stephen and said he turned to stephen and asked blandly do i contradict myself very well then i contradict myself it seems history is to blame his seacold eyes looked on the empty bay it seems history is to blame +on me and on my words unhating lal the ral the ra lal the ral the raddy

      But I don't know how many of those you would consider positives?

      The real problem here is that String::Approx (and most of the other similar modules), compare strings! That is just sequences of characters without considering any structure they contain--like words.

      For many purposes, it would be better to perform an "edit distance" algorithm that operates on whole words, rather than characters. For example,

      the quick brown fox jumps over the lazy dog the quack brown fix jumps over the lazy dog

      would give a similarity rating of 77%, because 7 out of 9 words matched.

      But even then, it doesn't tell the whole story. Consider:

      the quick brown fox jumps over the lazy dog the lazy dog jumps over the quick brown fox

      A word based edit distance might rate these as substantially the same, but semantically, they are quite different.

      Conversely:

      the quick brown fox jumps over the lazy dog the lazy dog falls over the quick brown fox

      might get rated less favourably, though semantically they are much closer.

      You really need to have a much clearer idea of what it is you are trying to do. Or if you have it clear in your own head, you need to express it much more clearly.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.