Re^2: Most specific pattern

I agree with this...however, the devil's in the details. Given an arbitrary regex, how does one create a metric around how many non-wildcarded characters are in it? Is there a module that takes care of this? Or at least one that one could bend to make it fit?

thor

Feel the white light, the light within Be your own disciple, fan the sparks of will For all of us waiting, your kingdom will come

Comment on Re^2: Most specific pattern

Replies are listed 'Best First'.
Re^3: Most specific pattern by jhourcle (Prior) on Jul 02, 2005 at 15:04 UTC
That was just one of the metrics that I could think of... unless the number of regexes were so large that you couldn't rank them yourself (Going with the assumption that I know more about the process than a regex does) If I had to go completely on just odds of matching, I would think it'd be easiest to take a representative sample of inputs, and test them against each of the regexes, and build a table with the odds. If you don't have a log of those inputs for testing, then we'd have to get more creative ... I might use something like the following -- Any character or zero width assertion gets 1 point. (unless the assertion is pointless, like '\W\b\w' Any character class of n characters gets f(n) points, where f(n) yields a number less than one, and decreases as n increases (maybe 1/n, or sqrt(1/n) ) Quantifiers reduce the value of the items they modify ... perhaps as multipliers... ( ? = 0.5; + = 0.6; * = 0.25; +? = 0.7; *? = 0.35 ) (I'm just pulling numbers out of the air...you'd want to tweek the numbers 'till you get good results for your situation). Alterations provide something less than the points value of each of its possibilities. (I have no clue on a formula for this one...) I'm not aware of a module to do this sort of things, but that doesn't mean that there isn't one out there.	[reply]

Replies are listed 'Best First'.

Re^3: Most specific pattern
by jhourcle (Prior) on Jul 02, 2005 at 15:04 UTC

That was just one of the metrics that I could think of... unless the number of regexes were so large that you couldn't rank them yourself (Going with the assumption that I know more about the process than a regex does)

If I had to go completely on just odds of matching, I would think it'd be easiest to take a representative sample of inputs, and test them against each of the regexes, and build a table with the odds.

If you don't have a log of those inputs for testing, then we'd have to get more creative ... I might use something like the following --

Any character or zero width assertion gets 1 point. (unless the assertion is pointless, like '\W\b\w'
Any character class of n characters gets f(n) points, where f(n) yields a number less than one, and decreases as n increases (maybe 1/n, or sqrt(1/n) )
Quantifiers reduce the value of the items they modify ... perhaps as multipliers... ( ? = 0.5; + = 0.6; * = 0.25; +? = 0.7; *? = 0.35 ) (I'm just pulling numbers out of the air...you'd want to tweek the numbers 'till you get good results for your situation).
Alterations provide something less than the points value of each of its possibilities. (I have no clue on a formula for this one...)

I'm not aware of a module to do this sort of things, but that doesn't mean that there isn't one out there.

[reply]