comment on

I didn't see in your problem explanation a description of how many "standardized responses" there could be. Are we talking about thousands? Hundreds? Tens?

It would also be useful to know whether the incomplete versions of the standardized responses are at least predictable, and unique. I understand that the entire response text might differ from transaction to transaction, but does a "100 - Bad Transaction" message always get abbreviated as "Bad T" before being embedded in the response text, and is the abbreviation unique so that no two standardized response codes could have the same abbreviation?

Let's say you've got a total possible 100 standardized responses / codes. Start by building a crossreference table that x-refs abbreviations with their full-sized versions:

my %xref = ( 
    'abbrev1' => '100 - Non-Abbreviated 1',
    'abbrev2' => '200 - Non-Abbreviated 2',
    ...
);
[download]

Next build up a big regex full of alternations:

my $alternations = join '|', keys %xref;
my $regexp = qr/\b($alternations)\b/;
[download]

Next, scan your response text and look up the crossref:

my( $abbrev ) = $raw_response =~ m/$regexp/;
my $std_response;
if( exists $xref{$abbrev} ) {
    $std_response = $xref{$abbrev};
}
else {
    die "No valid response found in <<$std_response>>";
}
print "$std_response\n";
[download]

Perl's regular expression engine (as of 5.10, if I recall) performs "trie optimization" for alternation, which should be very fast. While hash keys cannot be Regexp objects, they could contain the text that you will use as components of a regexp pattern.

It's possible that this approach won't work for you if the possible abbreviations aren't unique, or if one abbreviation could be truncated in some way as to produce another valid abbreviation. It also won't work if you can't count on abbreviations being predictable. If those sorts of issues exist, you might have to explain to us how you as a human would look at the response text and visually/mentally detect a standardized response abbreviation. Then the problem would be to try to turn that process into a set of rules that could be implemented programatically.

Dave

In reply to Re: Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem by davido
in thread Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem by ted.byers

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.