in reply to Help required in RE strategy
In any case, I'd spend some labor to review your own database entries and see whether they all have something like the "PSLA0L00U00E" sort of substring, and make sure to have that in its own field. Then I'd go through a list of merchant feeds like this:
# suppose %product_id has the "PSLA0L00U00E"-like strings from the DB # as hash keys, and something useful (row_ids?) as hash values... open( REJ, ">", "merchant_data.to_review" ) or die $!; while (<DATA>) { # reading merchant stuff line-by-line my $orig = $_; s/\W+//g; for my $model ( keys %product_id ) { if ( index( $_, $model ) >=0 ) { handle_a_match( $orig, $model ); $orig = ''; last; } # if the last character of $model is "optional", then # use this else block: else { chop ( my $modl = $model ); if ( index( $_, $modl ) >= 0 ) { hanle_a_match( $orig, $model ); $orig = ''; last; } } } print REJ $orig unless ( $orig eq '' ); }
I wouldn't be surprised if you had to try other "adaptive" matches besides removing punctuation and possibly removing a final letter -- e.g. to handle crap like upper-case letter o vs. digit zero, lower-case letter L vs. digit one, etc, which would require a more elaborate regex match. For instance, instead of using index() in the "optional" else block above, you could build a regex there like this:
else { my $modelrgx = $model . "?"; # make last character "option +al" $modelrgx =~ s/[1Il]/[1Il]/g; $modelrgx =~ s/[0O]/[0O]/g; if ( /$modelrgx/ ) { handle_a_match( $orig, $model ); $orig = ''; last; } }
|
|---|