comment on

It seems like the merchants are mostly behaving pretty well in regards to the "PSLA0L00U00E" part of your identifier -- 3 out of 4 match it exactly when you ignore spurious "punctuation", and the odd-ball simply forgot the final "E" (maybe he assumed the final "E" wasn't important, and in case it is, you should be entitled to push his feed back at him until he gets it right ;).

In any case, I'd spend some labor to review your own database entries and see whether they all have something like the "PSLA0L00U00E" sort of substring, and make sure to have that in its own field. Then I'd go through a list of merchant feeds like this:

# suppose %product_id has the "PSLA0L00U00E"-like strings from the DB
# as hash keys, and something useful (row_ids?) as hash values...

open( REJ, ">", "merchant_data.to_review" ) or die $!;

while (<DATA>) {  # reading merchant stuff line-by-line
    my $orig = $_;
    s/\W+//g;
    for my $model ( keys %product_id ) {
        if ( index( $_, $model ) >=0 ) {
            handle_a_match( $orig, $model );
            $orig = '';
            last;
        }

# if the last character of $model is "optional", then
# use this else block:
        else {
            chop ( my $modl = $model );
            if ( index( $_, $modl ) >= 0 ) {
                hanle_a_match( $orig, $model );
                $orig = '';
                last;
            }
        }
    }
    print REJ $orig unless ( $orig eq '' );
}
[download]

I wouldn't be surprised if you had to try other "adaptive" matches besides removing punctuation and possibly removing a final letter -- e.g. to handle crap like upper-case letter o vs. digit zero, lower-case letter L vs. digit one, etc, which would require a more elaborate regex match. For instance, instead of using index() in the "optional" else block above, you could build a regex there like this:

        else {
            my $modelrgx = $model . "?"; # make last character "option
+al"
            $modelrgx =~ s/[1Il]/[1Il]/g;
            $modelrgx =~ s/[0O]/[0O]/g;
            if ( /$modelrgx/ ) {
                handle_a_match( $orig, $model );
                $orig = '';
                last;
            }
        }
[download]

In reply to Re: Help required in RE strategy by graff
in thread Help required in RE strategy by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.