in reply to Re: Help on decide when study
in thread Help on decide when study

Dear davido
First of all, thank you for answer that fast. (:

Now, the points that still obscure to me:

Don't use study when the target string is short.
I have several years of experience coding perl, and this still an obscure point to me: What should I consider as a short string? I'm sure that 20 chars is short -- that's obvious. But how about 400, or even 2000 chars? Is that short?

Don't use study when you plan only a few matches against the target string.
Again, I'm in lack of a precise criteria to relay on: what shall I assume as a "few matches"? I know for sure that one or two is obviously "a few". But how many more shall I consider "a few"?

Don't use study when Perl has no literal text cognizance for the regular expressions that you intend to benefit from the study. Without a known character that must appear in any match, study is useless.
Sorry, I don't know what is "literal text cognizance", can you please explain it to me? (many thanks in advance!!)

Once more, thank you very much for care and answer, and thank you very much for sharing your knowledge.

May the gods bless you.


"In few words, translating PerlMonks documentation and best articles to other languages is like building a bridge to join other Perl communities into PerlMonks family. This makes the family bigger, the knowledge greather, the parties better and the life easier." -- monsieur_champs

Replies are listed 'Best First'.
Re: Re: Re: Help on decide when study
by davido (Cardinal) on Nov 13, 2003 at 19:35 UTC
    I was afraid someone might ask what "short" and "long" are in this context, as well as "a few" and "many".... Those are ambiguous quantifiers. Reminds me of my economics classes when professors talked about "shortrun", and "longrun".

    I think that the experts (such as Friedl, in MRE) intentionally don't try to define what is short, long, few, or many. I won't try to second guess their caution about defining thresholds. But I think it's safe to say that in the context of study, a few thousand characters is pretty short. However, the only way to be sure is to benchmark it. And as I implied, bothering with study at all should be a last resort, after exhausting other design options.

    As for "literal text cognizance", the very following sentence defined it: "Without a known character that must appear in any match, study is useless." Literal text cognizance means that unless your regexps are looking for literal text within the string (as opposed to only containing "wildcard" matches), study is useless.

    .....in other words, if your RE contains ONLY the "wildcard" matching constructs such as ". \w \d \s \S \W \D", etc, and doesn't contain literal text, you're wasting your time with study.

    ...for example (warning; silly examples):

    m/\w+\b.?\d*$/; # wouldn't benefit from study. m/abc/; # may benefit from study. m/\d+abc\W.+/; # may benefit from study.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      Along with the disqualifying criteria, already mentioned here, Mr. Friedl does give some useful approximate criteria in the "Owls" book. He describes a sufficiently-long string as being "at least several kilobytes", and gives an example of a useful application as that of checking each chapter in his book, represented as a single string, for "mistaken markup" using a number of re's on that same string. (I presume. He describes it as "a bevy of checks".) He doesn't mention having done any comparative benchmarking of that operation, or how to guess how many re's make it worthwhile.

      For my money, I suspect that the best rule of thumb is the one suggested by sauoq elsewhere in this discussion.