Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Most specific pattern

by graff (Chancellor)
on Jul 03, 2005 at 07:04 UTC ( [id://471994]=note: print w/replies, xml ) Need Help??


in reply to Most specific pattern

This process basically has a config file that has a bunch of regexes, and for each regex, an action. Items are passed to the process, and it determines the "best fit" and executes that action associated with it. However, in reading the documentation, it seems that the metric that they use for "best fit" is the pattern with the fewest wildcard characters.

Are you sure the documentation is a correct description of what the code does? (Are you sure you are interpreting it correctly?)

When a given input is able to match two or more distinct regexes in a set (and the chosen action differs for each regex), then there are two ways to choose among the possible paths:

  • If the regexes are applied in a specific order and the first one to match determines the action, then there's a simple and unavoidable rule, similiar to the idea in tphyahoo's reply: the regexes need to be ordered from longest specific match to shortest. Otherwise, the longer, more specific regexes would never get a chance to apply.

  • If all regexes are tested before any action is taken, and their successes are scored in some way before deciding which action to take, this would provide some flexibility in defining how the scoring works (you could even introduce probabilities, if you wanted results that were non-deterministic but could still correlate in some way with the inputs).

    But for most systems (where people want the results to be deterministic), it's more likely that the best way to score would be based on the length of the match, or rather, the difference between the length of the input and the length of the match. If two regexes yield the same minimum length difference between input and match, then the next criterion might be some sort of ranking among the regexes in terms of their relative literalness or specificity -- e.g. the one using more literals, smaller character classes and/or smaller quantity qualifiers would rank higher. But it's hard to conceive of a programmatic approach to ranking regexes this way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://471994]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-18 04:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found