in reply to Re: Re: Help on decide when study
in thread Help on decide when study

I was afraid someone might ask what "short" and "long" are in this context, as well as "a few" and "many".... Those are ambiguous quantifiers. Reminds me of my economics classes when professors talked about "shortrun", and "longrun".

I think that the experts (such as Friedl, in MRE) intentionally don't try to define what is short, long, few, or many. I won't try to second guess their caution about defining thresholds. But I think it's safe to say that in the context of study, a few thousand characters is pretty short. However, the only way to be sure is to benchmark it. And as I implied, bothering with study at all should be a last resort, after exhausting other design options.

As for "literal text cognizance", the very following sentence defined it: "Without a known character that must appear in any match, study is useless." Literal text cognizance means that unless your regexps are looking for literal text within the string (as opposed to only containing "wildcard" matches), study is useless.

.....in other words, if your RE contains ONLY the "wildcard" matching constructs such as ". \w \d \s \S \W \D", etc, and doesn't contain literal text, you're wasting your time with study.

...for example (warning; silly examples):

m/\w+\b.?\d*$/; # wouldn't benefit from study. m/abc/; # may benefit from study. m/\d+abc\W.+/; # may benefit from study.


Dave


"If I had my life to live over again, I'd be a plumber." -- Albert Einstein

Replies are listed 'Best First'.
Re: Re: Re: Re: Help on decide when study
by vacant (Pilgrim) on Nov 13, 2003 at 23:06 UTC
    Along with the disqualifying criteria, already mentioned here, Mr. Friedl does give some useful approximate criteria in the "Owls" book. He describes a sufficiently-long string as being "at least several kilobytes", and gives an example of a useful application as that of checking each chapter in his book, represented as a single string, for "mistaken markup" using a number of re's on that same string. (I presume. He describes it as "a bevy of checks".) He doesn't mention having done any comparative benchmarking of that operation, or how to guess how many re's make it worthwhile.

    For my money, I suspect that the best rule of thumb is the one suggested by sauoq elsewhere in this discussion.