I am at a loss as to how to even begin this problem.

In a transaction processing system, in most cases the server sends a response code and a response text (giving the meaning of the response code). Those are trivially easy to work with. In fact, that data is good enough that I could even establish a hash containing the response text value as a key and the response code as a value, and use that with the response text provided for a given transaction and accurately get the code.

Of course, when they send both, there is no need. However, I have to use one service that does not send the standardized code, and the standard response text is provided as part of a much longer 'response text', and at that, the complete 'standardized response' is not sent, only an abbreviated form of it. Therefore, what I have to do is compare the response text they return and use a regular expression to see which key in a hash, that maps response text to response code, best fits it. In other words, once I create the hash mapping response text to response code, I have to determine, as quickly as possible, which key contains the longest substring that is contained in the response text that this service has provided.

Alas, I do not control this particular server, and suggestions to improve it take a long time to be acted upon, if they choose to act at all. I could 'ask' that they send the proper response code as a separate field in their response, but I might not live long enough to see that happen.

Now, I know I could implement a brute force algorithm, using a statistical similarity measure drawn from information theory, but I am hoping there is a guru out there who knows how best to handle this sort of problem in an efficient manner. I have to ensure that whatever other processing the script in question has to do, that the response time is less than a second (the faster the better). I had more than half a dozen algorithms to trim whitespace from the start and end of an arbitrary string, that all worked flawlessly using one regex or another, and yet there was an order of magnitude difference in how quickly they did the job. Thus, it would be good to know several options that would all work correctly, and I'd then write some benchmarking code to facilitate comparing their speed.

Any suggestions as to fast algorithms to solve this problem would be greatly appreciated.

Thanks

Ted


In reply to Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem by ted.byers

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.