Re: Fuzzy text matching... again

kiz,
You are not always comparing two strings. You are mentally tokenizing these strings. Further, you are giving context to them - things in parens at the end do not seem to be important to you. In other words, you have to build a solution that applies the same mental process as yourself to determine if the reference sources are the same. I have built a solution that does this in the past but since I was hired to do it, I can't share the code with you.

There are plenty of tools to build your own but you have to figure out how to glue them together. This is not that simple, so neither can your approach. I used a layered approach. Let me give you some things to consider.

Number of tokens in strings
Order of tokens in strings
Proximity of tokens (Archivio Marini vs Archivio Giuliano Marini vs Archivio Not Very Close At All Marini)
Remove unimportant tokens (trailing text in parens)
Normalize tokens
- Abbreviations
- Typos
- Alternative spellings (color and colour)
- Punctuation
When using an edit distance on individual tokens, consider the length of the tokens not just the distance

Now consider all the tools in your tool bag and how they may be useful. Here are some examples:

Text::Soundex or Text::Metaphone to help identify alternative spellings
Regex::Assemble to help normalize words - especially ones with typos (continuously add to list as new variations are discovered)

I can see you have already searched CPAN and know about things like Text::Compare and Text::PhraseDistance but these seem to be publication references. There are a number of modules on CPAN for citations and bibliography references - you may be able to leverage them as well. It would also be helpful to know more about the overall project because there are some other tools that may be helpful. For instance, do you have a known list of publications and have a list that needs to be identified or do you have one huge bunch and are trying to identify duplicates? The approach I would take is different in both case.

I have a stack full of notes on the topic of text comparison and analysis I have been meaning to write about at length. If you need more help, speak up.

Cheers - L~R

Comment on Re: Fuzzy text matching... again

Replies are listed 'Best First'.
Re^2: Fuzzy text matching... again by almut (Canon) on Jan 07, 2010 at 16:08 UTC
Remove unimportant tokens (trailing text in parens) You certainly have a number of good points. However, parenthesization need not necessarily imply subordination, it could also stand for an alternative name/term, as in the OP's case #5: Cracow University of Technology Digital Library Biblioteka Cyfrowa Politechniki Krakowskiej (Digital Library of Cracow University of Technology) where the part in parentheses has a better chance of contributing to a successful match than the unparenthesized part. Just to illustrate one of the many potential issues the OP might encounter. And while we're at it: how would a machine identify what is a name and what not - as in "Archivio Giuliano Marini" - without consulting either a database of common names, or checking against a list of all known regular words (+ inflections) in a particular language? Even Google translate apparently gets it wrong when translating "Archivio Giuliano Marini" into English (leaving "Archivio" as is, instead of translating it to "archive" — even though you tell it what source language it is), while it gets it right (interestingly) with "Archivio Marini"...	[reply]
Re^3: Fuzzy text matching... again by Limbic~Region (Chancellor) on Jan 07, 2010 at 16:27 UTC
almut, The determination of what is unimportant is something the OP will have to come up with. To be honest, until you pointed it out as a translation, I had no idea. Obviously nothing will be perfect, which is why I like the "prospector" method and continous refinement. I know it isn't bayesian filter heuristics but I have been quite successful with it. And while we're at it: how would a machine identify what is a name and what not I can't remember mentioning names anywhere in my response. Name matching has a completely different kind of complexity (Mark and Mary only have an edit distance of 1; Richard, Rich, Dick, Dickie may all refer to the same person; cultural, ethnic and religious variations to a name; gender; variations in converting from native to latin-1; etc). And yes, there are huge databases of names (common and otherwise) to deal with this. Check out Global Name Recognition software owned by IBM for a very expensive solution to that problem. Of course, names are not the only specific kinds of strings that have their own complexity. Mailing addresses, dates, identifying an anonymous author, identifying plagerism, indexing, etc. I limited my advice to the problem as I understood it. I obviously could have missed the mark thinking these were publication citations making the "what is a name" germane but I don't see why it can't just be treated like any other token. I obviously missed the trailing parens sometimes being important as a translation but I assume the OP will be capable of constructing an approach for removing unimportant tokens. Cheers - L~R	[reply]
Re^4: Fuzzy text matching... again by almut (Canon) on Jan 07, 2010 at 18:03 UTC
I can't remember mentioning names anywhere in my response. My last paragraph wasn't meant as a reply to anything you've said in particular, more as a "P.S." — Sorry for not having made it clear. ...these were publication citations making the "what is a name" germane but I don't see why it can't just be treated like any other token. I just wanted to point out that if you understand what is what in "Archivio Giuliano Marini" you have a much better chance of telling if something else like "Archivio Giuliano Cassini" is referring to the same thing or not. I.e., knowing that "Archivio" simply means "archive" and that "Giuliano" is a common given name, you'd most likely figure out that they're two different archives, while a simple token comparison might identify them as being the same (due to two of three tokens matching, and the third one sounding similar).	[reply]