gwhite has asked for the wisdom of the Perl Monks concerning the following question:

Warning: not Perl specific

I am writing a proposal system in Perl. The end user has a large database of projects and wants the 5 top projects that are "similar" to the one they are bidding on to be pulled from the database and added to the proposal in the appropriate section. There are about 10-12 different criteria to determine "similar". Some of the criteria should be an exact match, others will be within a range.

My question: (finally), are there any algorithms, modules, or such that could assist in developing a scoring system for the criteria. AI:Fuzzy seems to be the most promising but doesn't look like it has been worked on for a couple years. Certainly, any sharing of previous experience of similar works is welcome.

g_White

Replies are listed 'Best First'.
Re: Comparitive Scoring System
by mstone (Deacon) on Jan 02, 2002 at 22:17 UTC

    I don't know of any modules, but fuzzy logic includes a concept called 'hedging' that may work for you.

    Technically, a hedge modifies a fuzzy value. As a rough example, if 'old' means 'more than 50', the hedge 'very' would make 'very old' mean 'more than 80'. Individually, hedges provide a convenient way to model human terms like 'roughly', 'mostly', 'vaguely' and so on.

    Hedges tend to have the same effect on any modificand. The hedge 'very' does the same thing to 'very old' that it does to 'very tall', 'very close', 'very green', etc. That uniformity produces an interesting side-effect when you compare hedges to each other.

    The relative effect of two hedges tends to remain uniform regardless of modificand. 'Very' is always bigger than 'mostly', regardless of what you modify. That means you can build a set of hedges that form a reliable ordinal scale (one that defines an order, i.e.: this one comes before that one) for any set of terms.

    In other words, hedges let you mark off two endpoints of a line -- 'absolutely X' and 'absolutely not X' -- then define a reliable sequence of positions along that line. And once you have a reliable sequence, you can sort things.

    Now for the sake of vocabulary, let's call that line an axis. If we use more than one value for X, we create a space with multiple axes. We could make a three-dimensional space with axes for 'big', 'green', and 'loud', for instance. Then we can define objects with 3-dimensional points that represent their position in that space. A regular frog would sit at the low end of all three axes, while a 50' heavy-metal Irish frog would go to the high end of all three. The ocean would score high for 'big' and medium for 'green' and 'loud'.

    To select objects that are similar, we simply chop out a region whose center falls at a given point. If you want to be simple, you can make the region an N-dimensional cube.. anything within two ticks of point 'the sea' on any axis, for instance. If you want to be fancy, you can make it an N-dimensional sphere.

    Finally, you can hedge the modificand 'closeness' to determine the size of the region. 'very close' would generate a small region, while 'sorta close' would generate a larger one.

    Algorithmically, this reduces your problem to range evaluation, which is a known subject. Some databases (postgresql) even have geometric operators that will decide whether a given point falls inside a given region. The big job will be to choose a set of qualities to use as axes, then to assign a position in that space for every project in your table.

    Hope that at least gives you some ideas to kick around.

      Just remembered something else that might be useful..

      When it comes time to rate projects in terms of qualities X1..Xn, don't ask people to look at a single project and assign numbers for each axis. The numbers you get will vary from person to person, according to context. Most people would say that a 50' frog is big, for example. But those same people would also say that the ocean is big, and that the entire known universe is big. If you ask people rate the 'bigness' of those three items, the numbers you get will vary based on the order in which they're presented.

      OTOH, pretty much everyone will agree that the ocean is bigger than a 50' frog, but smaller than the known universe. In other words, people's perception of relative X-ness tends to be more reliable than their absolute rating.

      So instead of rating projects in isolation, give people two projects and ask them to compare them in terms of qualities X1..Xn. For each pair, you'll get a set of n more/same/less ratings that place the two projects relative to each other. Then once you collect enough ratings, you can distribute items along each axis by sorting them topologically.

      It is possible that the ratings you collect will be inconsistent, especially if you have more than one person ranking projects. One person may say P1 is bigger than P2, another may say P2 is bigger than P3, and a third may say P3 is bigger than P1. Topologically, that's a loop, and will screw you up if you try to convert ratings to absolute positions along the scale.

      You solve that problem by collecting more data, and making the relative values fuzzy. If 60% of people say P1 is bigger than P2, you say that 60% of P1 is bigger than P2, but 40% is smaller. If 30% of people say P2 is bigger than P3, P2 outs P3 by a 30/70 margin. If only 10% of people say that P3 is bigger than P1, it means that 10% of P3 is still bigger than some part of P1:

      P1 ---------- P2 ---------- P3 ----------

      Yes, that's a contrived example.. I'm doing text-graphs, here. ;-) The same idea applies for less typeable values, though.

      If you really want to obsess about it, you can generate a range and distribution function for each item along each axis, which is pretty much the definition of a fuzzy point. Then you can select a region from the space, and rate projects in terms of how much each one falls inside that region.

      Hope I haven't stepped over the line into hopeless incomprehensibility.

        If you really want to obsess about it, you can generate a range and distribution function for each item along each axis, which is pretty much the definition of a fuzzy point. Then you can select a region from the space, and rate projects in terms of how much each one falls inside that region.

        This is where I am at on it. I was hoping something would save me the pain of this process. I can't figure out how to do the region, other than map each zip code to a lat/long then do a circle distance from the new project point.

        Thanks for brain dump

        g_White
Re: Comparitive Scoring System
by mrbbking (Hermit) on Jan 02, 2002 at 20:56 UTC
    That depends on the critera you're using to determine similarity. Plain old sort may do the trick if the comparisons are lexical or numeric. Take the first five elements and call it a day.

    Could you add some examples of the things you'll be comparing? ...and what determines 'similar', if it's not obvious...

      Sort won't work

      This is for a interior design firm, so the similar criterias are, type (retail, office, law office, medical office, food court...50+ categories), sq footage, contract amount, region of country, office that did project, scope of work, construction (new, remodel, historical), client type

      Things like sq footage and contract amount don't have to be exact but within a range of say 10%+ or -. Region is concentric, city, state, group of states, half of country, country(and I don't immediately see how this could be done with AI::Fuzzy

      Many of the types are related so an accountants office scores more similar to a law office than does a doctors office. So there are very few black and white comparisons, I am trying to figure out how to do all these shades of gray without several huge comparison tables, but maybe that is what it will take.

      g_White