in reply to Re: Help required in RE strategy
in thread Help required in RE strategy

Such a word database could work as a filter as you said: examine only db entries that contain 2-3 words of the input, say:
Compaq Presario Laptop Model PSLA 0L00U00E
Then you could use expensive algorithms like http://en.wikipedia.org/wiki/Levenshtein_distance to find the closest match.

You could also split the word database for a better filtering. For example require that one word matches in a vendor table (so that you examine only Compaq products).

A nice thing of this algorithm is that you can start with a very restrictive filter and whenever you get not hits, you can reduce the minimal number of matching words.

Replies are listed 'Best First'.
Re^3: Help required in RE strategy
by suaveant (Parson) on Aug 22, 2007 at 15:08 UTC
    Yeah... but hopefully it won't degrade into fuzzy matching... it looks like matching the unmatched words with all the non-alphanum chars will get direct matches in the cases specified, and hopefully most others. That would be much cheaper and more reliable.

    One could take a further step and store the word boundaries and make sure that when you make a stripped match the beginning of your would started at a boundary and the end of your word ended at a boundary, regardless of how many boundaries there were in the middle. Harder, but even more reliable (though probably unnecessary in this case).

                    - Ant
                    - Some of my best work - (1 2 3)

      This is not as expensive as you probably think. Assume
      Compaq Presario Laptop Model No P440 L100 Series (PSLA0L-00U00E)
      ... as input ->
      00 440 100 Compaq Laptop Model No Presario Series PSLA
      Then we need todo something like this:
      ... # get vendor id: SELECT id FROM vendor WHERE NAME=00 or NAME=440 ... # check all words foreach (@word) { SELECT product_id FROM words WHERE vendor_id=? and word=? for my $id (@ids) $cnt{$id}++ } }
      How you then analyze this match statistic depends on the data. But the editdistance is probably the wrong metric, you are right.
        Don't you want that first query to be an AND... assuming you can be relatively sure of all terms matching.

                        - Ant
                        - Some of my best work - (1 2 3)