in reply to Help required in RE strategy

It is a tough case... I think I personally would take this approach.

Store the vendor strings as they are and either store as well with all non alphanum characters removed, or do it while you search (probably the former, if you do a lot of searching).

Match all your words, alphanum only, against the alphanum versions of the vendors... once you have a match, use the individual words in your string to match against the original string. This with the most matches rank highest.

Or maybe even go the other way around... store a word database and assuming you get a couple hits use that to shrink your search list for the alphanum reduced search. This would probably save you a lot of processing since you could use keying to reduce the search load.

Of course, I am sure there are other ways.

                - Ant
                - Some of my best work - (1 2 3)

Replies are listed 'Best First'.
Re^2: Help required in RE strategy
by lima1 (Curate) on Aug 22, 2007 at 15:01 UTC
    Such a word database could work as a filter as you said: examine only db entries that contain 2-3 words of the input, say:
    Compaq Presario Laptop Model PSLA 0L00U00E
    Then you could use expensive algorithms like http://en.wikipedia.org/wiki/Levenshtein_distance to find the closest match.

    You could also split the word database for a better filtering. For example require that one word matches in a vendor table (so that you examine only Compaq products).

    A nice thing of this algorithm is that you can start with a very restrictive filter and whenever you get not hits, you can reduce the minimal number of matching words.

      Yeah... but hopefully it won't degrade into fuzzy matching... it looks like matching the unmatched words with all the non-alphanum chars will get direct matches in the cases specified, and hopefully most others. That would be much cheaper and more reliable.

      One could take a further step and store the word boundaries and make sure that when you make a stripped match the beginning of your would started at a boundary and the end of your word ended at a boundary, regardless of how many boundaries there were in the middle. Harder, but even more reliable (though probably unnecessary in this case).

                      - Ant
                      - Some of my best work - (1 2 3)

        This is not as expensive as you probably think. Assume
        Compaq Presario Laptop Model No P440 L100 Series (PSLA0L-00U00E)
        ... as input ->
        00 440 100 Compaq Laptop Model No Presario Series PSLA
        Then we need todo something like this:
        ... # get vendor id: SELECT id FROM vendor WHERE NAME=00 or NAME=440 ... # check all words foreach (@word) { SELECT product_id FROM words WHERE vendor_id=? and word=? for my $id (@ids) $cnt{$id}++ } }
        How you then analyze this match statistic depends on the data. But the editdistance is probably the wrong metric, you are right.