in reply to Ignoring spelling mistakes
How many files? How big? How many strings to look for? What time pressure (ie. once a year; once a week; once an hour)?
Fuzzy matching is a notoriously slow process. What approaches are likely to work best very much depends upon the scale of the problem space.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Ignoring spelling mistakes
by manna45 (Novice) on Feb 05, 2010 at 04:25 UTC | |
| [reply] |
by BrowserUk (Patriarch) on Feb 05, 2010 at 12:51 UTC | |
approximately 5000-10000 lines ... 2-3 files So, is that 15,000 to 30,000 "phrases" to be compared against each other, (225 million to 900 million comparisons), once or twice a month? How long is a "line"? Does each line consist of a single phrase for comparison? Or phrases wrapped across lines? You really need to be a lot more precise in your estimates of your problem if your going to stand a chance of doing it in a timely manner. That said. I don't want my boss sitting and drumming his fingers and cursing me while the check is in progress:) If this is only once or twice per month, can't he set it going (or schedule it) overnight so that he can just look at the results in the morning? To give you a feel for what you are up against, I took the text of a longish novel--1.5 MB 1/4 million words--and broke it up (crudely) into 24,685 sentences. I lowercased them, stripped all puctuation and numerics and normalised the whitespace. I then ran a straight string compare ($string[ $a ] eq $string[ $b ]):
It took just over a minute to find 46,000 matches amongst the 304 million comparisons. So, an average rate of 4.5 million comparisons per second. I then substituted String::Approx::amatch() for eq and ran it on the first 1000 & 2000 sentences using the default settings:
which shows an average rate of ~ 77,000 per second. Multiply that out and you get 1hr 10 minutes for the whole file. Not too bad, but what will it tell you? He are a few examples of "matches" from the first screenful of non-exact matches:
As you can see, it produces a lot of false hits. ~4000 from the first 1000 lines. That's going to be a huge amount to wade through. You can maybe improve that (and speed things up somewhat), by excluding any short sentences. If you limit it to 4 words or greater, the first 1000 lines finds only 14 non-exact matches:
But I don't know how many of those you would consider positives? The real problem here is that String::Approx (and most of the other similar modules), compare strings! That is just sequences of characters without considering any structure they contain--like words. For many purposes, it would be better to perform an "edit distance" algorithm that operates on whole words, rather than characters. For example,
would give a similarity rating of 77%, because 7 out of 9 words matched. But even then, it doesn't tell the whole story. Consider:
A word based edit distance might rate these as substantially the same, but semantically, they are quite different. Conversely:
might get rated less favourably, though semantically they are much closer. You really need to have a much clearer idea of what it is you are trying to do. Or if you have it clear in your own head, you need to express it much more clearly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |