It seems like the merchants are mostly behaving pretty well in regards to the "PSLA0L00U00E" part of your identifier -- 3 out of 4 match it exactly when you ignore spurious "punctuation", and the odd-ball simply forgot the final "E" (maybe he assumed the final "E" wasn't important, and in case it is, you should be entitled to push his feed back at him until he gets it right ;).
In any case, I'd spend some labor to review your own database entries and see whether they all have something like the "PSLA0L00U00E" sort of substring, and make sure to have that in its own field. Then I'd go through a list of merchant feeds like this:
# suppose %product_id has the "PSLA0L00U00E"-like strings from the DB
# as hash keys, and something useful (row_ids?) as hash values...
open( REJ, ">", "merchant_data.to_review" ) or die $!;
while (<DATA>) { # reading merchant stuff line-by-line
my $orig = $_;
s/\W+//g;
for my $model ( keys %product_id ) {
if ( index( $_, $model ) >=0 ) {
handle_a_match( $orig, $model );
$orig = '';
last;
}
# if the last character of $model is "optional", then
# use this else block:
else {
chop ( my $modl = $model );
if ( index( $_, $modl ) >= 0 ) {
hanle_a_match( $orig, $model );
$orig = '';
last;
}
}
}
print REJ $orig unless ( $orig eq '' );
}
I wouldn't be surprised if you had to try other "adaptive" matches besides removing punctuation and possibly removing a final letter -- e.g. to handle crap like upper-case letter o vs. digit zero, lower-case letter L vs. digit one, etc, which would require a more elaborate regex match. For instance, instead of using index() in the "optional" else block above, you could build a regex there like this:
else {
my $modelrgx = $model . "?"; # make last character "option
+al"
$modelrgx =~ s/[1Il]/[1Il]/g;
$modelrgx =~ s/[0O]/[0O]/g;
if ( /$modelrgx/ ) {
handle_a_match( $orig, $model );
$orig = '';
last;
}
}
|