Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Some kind of fuzzy logic.

by the_0ne (Pilgrim)
on Oct 17, 2003 at 19:56 UTC ( [id://300129]=perlquestion: print w/replies, xml ) Need Help??

the_0ne has asked for the wisdom of the Perl Monks concerning the following question:

I know that title is a little deceiving, but then again, after you see what I want to be able to do, it might not be all that deceiving...

I'm having a problem on what I thought to be a small project. I have a list of individual names and companies, altogether in one long list. There is nothing signifying if the text is a person's name or a company name. I have to compare it against another list and find possible conflicts. Well, anyone that has done this can imagine the possibilities, I'll just list a few here...
search for Aetna Insurance Company should match... Aetna Insurance Company Aetna Ins. Co. Aetna Insurance Co. Aetna Ins. Company Aetna Ins Company Aetna Insurance Co Aetna Ins Co
That's only a very minute example of other particulars I've come up with...
search for Sam Jones should bring up... Sam Jones Sam J. Jones Sam J Jones S. Jones Samuel Jones
This is why I used the title fuzzy logic, is there any kind of perl module out there that can do this? Even a bunch of modules that can give me various parts of this would be good. One in particular my boss brought up is Soundex, which is here. However I'm finding that soundex is really only good for misspellings like Smith and Smyth.

I can't imagine this is an easy question to answer, and I'm sure there's not going to be one all-encompassing module to do this. I'm just hoping for some pointers and maybe some modules that could do bits and pieces of this.

Thanks in advance Monks...

Replies are listed 'Best First'.
Re: Some kind of fuzzy logic.
by davido (Cardinal) on Oct 17, 2003 at 20:02 UTC
    There are several that may be helpful:

    Text::Abbrev - create an abbreviation list from a table

    Tie::Hash::Abbrev - a hash that can be accessed with abbreviated keys.

    Tie::Hash::Abbrev::Smart - same description as the last one.

    Tie::Hash::Approx - you get the idea...

    String::Approx - Perl extension for approximate matching (fuzzy matching).


    Dave


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
Re: Some kind of fuzzy logic.
by BrowserUk (Patriarch) on Oct 17, 2003 at 21:20 UTC

    This might serve as a starting point.

    #! perl -slw use strict; sub genRegex { my $query = shift; $query =~ tr[A-Za-z0-9 ][a-za-z0-9 ]; my @words = split ' ', $query; my $regex = join '\W+'.$/, map{ my $s=''; $s = "(?:$_$s)?" for reverse split ''; chop $s; "\\b$s\\b" } @words; return qr[$regex]xi; } my $regex; while( my $line = <DATA> ) { chomp $line; $regex = genRegex( $1 ) and print'' and next if $line =~ m[search for (.*) should match...]; print +($line =~ $regex ? 'Matched ' : 'Failed ' ), $line; } __DATA__ search for Aetna Insurance Company should match... Aetna Insurance Company Aetna Ins. Co. Aetna Insurance Co. Aetna Ins. Company Aetna Ins Company aetna insurance co Aetna Ins Co Atna Insurance Company Aetna Ins. Ca. Aetna Insurance Go. Aetna Ins. Compary search for Sam Jones should match... Sam Jones Sam J. Jones Sam J Jones S. Jones Samuel Janes Sam L Jones Sam J.J. Jones Sam J Jones B. Jones Manuel Jones

    Results:

    P:\test>junk Matched Aetna Insurance Company Matched Aetna Ins. Co. Matched Aetna Insurance Co. Matched Aetna Ins. Company Matched Aetna Ins Company Matched aetna insurance co Matched Aetna Ins Co Failed Atna Insurance Company Failed Aetna Ins. Ca. Failed Aetna Insurance Go. Failed Aetna Ins. Compary Matched Sam Jones Matched Sam J. Jones Matched Sam J Jones Matched S. Jones Failed Samuel Janes Failed Sam L Jones Matched Sam J.J. Jones Matched Sam J Jones Failed B. Jones Failed Manuel Jones

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

      Thanks everybody for your help. I'll try out your example BrowserUK.
Re: Some kind of fuzzy logic.
by dragonchild (Archbishop) on Oct 17, 2003 at 20:17 UTC
    Basically, you're matching abbreviations. So, you need a 2-way mapping between full name and abbreviation. Once you have that, you can do a split, remove periods, and replace using your mapping.

    Is this a perfect solution? No. But, it'll get you 80% of the way there ...

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Unfortunately getting the extra 20% is an exercise in diminishing returns.

      This is a very complex problem and you'll need to decide how much time (and/or money) you're prepared to spend getting as close to a complete solution as you deem necessary.

      There are some pretty impressive (and very expensive) commercial solutions dedicated to this problem.

      However you approach it, a machine will never get to 100% accuracy (at least not at this point).


      If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
      That way everyone learns.

Approximate matching of company names
by toma (Vicar) on Oct 18, 2003 at 05:56 UTC
    I have been able to make this work with String::Approx. I had to match one list of companies against another. The lists were made by different organizations.

    First, it helps to strip the noise from the company names, such as Inc, Co, Corp, GMBH, LTD, etc.

    String::Approx finds a distance by looking at insertions, deletions, and substitutions needed to transform one string to another.

    A different approach, which worked better for me, was to make lists of all the substrings of length n in the source string. I called these n-tuples. I compared the percentage overlap between the n-tuple sets for each name in one list to the n-tuples for each word in the other list. The best value for the length n of the tuples was three or four.

    Very close matches could be completely automated this way. For matches that were not so close, I finished the matching task manually. I made a web user interface that had a selection list of the match candidates ranked by the closeness of the match. The closeness was determined by the percentage of n-tuples that matched. I selected the best match for each entry on the amongst these top-ranked match candidates.

    It should work perfectly the first time! - toma
      A different approach, which worked better for me, was to make lists of all the substrings of length n in the source string. I called these n-tuples. I compared the percentage overlap between the n-tuple sets for each name in one list to the n-tuples for each word in the other list. The best value for the length n of the tuples was three or four.

      Toma, could you give me an example of what you mean by this paragraph? I don't want you to go to the trouble of code examples, I mean an example using text so I can better understand what you mean.

      Thanks...
        Here is the requested example that shows how n-tuples work:

        In this example, take n=3. The company names to compare are "tomacorp" and "tomarcorp". The 3-tuples of tomacorp are:

        tom oma mac aco cor orp
        The 3-tuples of tomarcorp are:
        tom oma mar arc rco cor orp
        The tuples in common are:
        tom oma cor orp
        Four of the six 3-tuples in tomacorp appear in tomarcorp. This is a 75% match.

        It should work perfectly the first time! - toma

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://300129]
Approved by BazB
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-28 15:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found