Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Pundits,
We get feed from our merchants and we will have to match their product name with the names maintained in our DB. The problem is the data that we get from the merchants don't follow a given pattern. This poses a major issue.

For example, name of one product in our DB is
Compaq Presario P440 L100 PSLA0L00U00E.

The feed from merchants are given below:
Compaq Presario Laptop Model No P440 L100 Series (PSLA0L-00U00E)
Compaq Presario Laptop P440 L100 Series PSLA0L 00U00E
Compaq Presario Laptop Model PSLA 0L00U00E
Compaq Presario 440 Laptop Model No-PSLA 0L00U00

As you can see the feed contains different names for the same product model. And this involves lot of manual work for us. So we decided to maintain our DB differently i.e.
a. each word will be maintained separately.
b. And if a word contains both number and character then character will be separated from number and both will be maintained separately. So the product given in the example, Compaq Presario P440 L100 PSLA0L00U00E, will be maintained as

1. Compaq
2. Presario
3. P
4. 440
5. L
6. 100
7. PSLA
8. 0
9. L
10. 00
11. U
12. 00
13. E

As you can see when we search for the 3rd word which is 'P', against the name provided by the merchant, it gets matched with the p in the 1st word 'Compaq' and so on. Require your help in solving this problem.

Thanks

Replies are listed 'Best First'.
Re: Help required in RE strategy
by suaveant (Parson) on Aug 22, 2007 at 14:41 UTC
    It is a tough case... I think I personally would take this approach.

    Store the vendor strings as they are and either store as well with all non alphanum characters removed, or do it while you search (probably the former, if you do a lot of searching).

    Match all your words, alphanum only, against the alphanum versions of the vendors... once you have a match, use the individual words in your string to match against the original string. This with the most matches rank highest.

    Or maybe even go the other way around... store a word database and assuming you get a couple hits use that to shrink your search list for the alphanum reduced search. This would probably save you a lot of processing since you could use keying to reduce the search load.

    Of course, I am sure there are other ways.

                    - Ant
                    - Some of my best work - (1 2 3)

      Such a word database could work as a filter as you said: examine only db entries that contain 2-3 words of the input, say:
      Compaq Presario Laptop Model PSLA 0L00U00E
      Then you could use expensive algorithms like http://en.wikipedia.org/wiki/Levenshtein_distance to find the closest match.

      You could also split the word database for a better filtering. For example require that one word matches in a vendor table (so that you examine only Compaq products).

      A nice thing of this algorithm is that you can start with a very restrictive filter and whenever you get not hits, you can reduce the minimal number of matching words.

        Yeah... but hopefully it won't degrade into fuzzy matching... it looks like matching the unmatched words with all the non-alphanum chars will get direct matches in the cases specified, and hopefully most others. That would be much cheaper and more reliable.

        One could take a further step and store the word boundaries and make sure that when you make a stripped match the beginning of your would started at a boundary and the end of your word ended at a boundary, regardless of how many boundaries there were in the middle. Harder, but even more reliable (though probably unnecessary in this case).

                        - Ant
                        - Some of my best work - (1 2 3)

Re: Help required in RE strategy
by graff (Chancellor) on Aug 23, 2007 at 02:30 UTC
    It seems like the merchants are mostly behaving pretty well in regards to the "PSLA0L00U00E" part of your identifier -- 3 out of 4 match it exactly when you ignore spurious "punctuation", and the odd-ball simply forgot the final "E" (maybe he assumed the final "E" wasn't important, and in case it is, you should be entitled to push his feed back at him until he gets it right ;).

    In any case, I'd spend some labor to review your own database entries and see whether they all have something like the "PSLA0L00U00E" sort of substring, and make sure to have that in its own field. Then I'd go through a list of merchant feeds like this:

    # suppose %product_id has the "PSLA0L00U00E"-like strings from the DB # as hash keys, and something useful (row_ids?) as hash values... open( REJ, ">", "merchant_data.to_review" ) or die $!; while (<DATA>) { # reading merchant stuff line-by-line my $orig = $_; s/\W+//g; for my $model ( keys %product_id ) { if ( index( $_, $model ) >=0 ) { handle_a_match( $orig, $model ); $orig = ''; last; } # if the last character of $model is "optional", then # use this else block: else { chop ( my $modl = $model ); if ( index( $_, $modl ) >= 0 ) { hanle_a_match( $orig, $model ); $orig = ''; last; } } } print REJ $orig unless ( $orig eq '' ); }

    I wouldn't be surprised if you had to try other "adaptive" matches besides removing punctuation and possibly removing a final letter -- e.g. to handle crap like upper-case letter o vs. digit zero, lower-case letter L vs. digit one, etc, which would require a more elaborate regex match. For instance, instead of using index() in the "optional" else block above, you could build a regex there like this:

    else { my $modelrgx = $model . "?"; # make last character "option +al" $modelrgx =~ s/[1Il]/[1Il]/g; $modelrgx =~ s/[0O]/[0O]/g; if ( /$modelrgx/ ) { handle_a_match( $orig, $model ); $orig = ''; last; } }
Re: Help required in RE strategy
by wfsp (Abbot) on Aug 22, 2007 at 15:54 UTC
    The product name in your db looks as though you may have rules on how to construct it. You don't have 'laptop', 'model', 'no', parens or dashes in it. Are all the names constructed in the same way? Perhaps filter them out first.

    Is the number at the end of the record unique? Do you have rules for constructing those? Perhaps concentrate on that.

    How many merchants are there? Are their naming conventions consistant? If they ordered the same m/c two days running would it be the same?

    I'd consider a separate table with the (filtered) alternatives pointing to your main db and flag up non matches. It would still mean there would be manual work to do but it would tail off.

    As suaveant notes normalising data is a pain.

    Good luck!

fix the data?? (Re: Help required in RE strategy)
by Anonymous Monk on Aug 22, 2007 at 17:04 UTC
    How about you have the feed contain a model number identifier and have the partner also supply a mapping file that associates the N different descriptions of the item with the one true model number. seems you'll get better mileage fixing the data instead of trying to hit a constantly moving target.
Re: Help required in RE strategy
by Anonymous Monk on Aug 23, 2007 at 13:15 UTC
    Appreciate your help. Thanks for all your suggestions.