in reply to Help on decide when study

Given that study's benefits are difficult to predict, it might be best to just benchmark with and without, to see which version is quicker.

The Owls book (1st Edition: Mastering Regular Expressions) says:

Study is most useful when you are matching a large string many times and your regular expressions contain literal text that must be found within the string.

Study is also known to contain bugs in older versions of Perl, so use with caution.

The best advice I can give a novice is to ignore study and look for other ways to optimize your code. If you can't find any design solution that is fast enough for your needs, then try invoking study and benchmark to see if it helps or not. But in general, don't expect a miracle.


Dave


"If I had my life to live over again, I'd be a plumber." -- Albert Einstein

Replies are listed 'Best First'.
Re: Re: Help on decide when study
by monsieur_champs (Curate) on Nov 13, 2003 at 19:23 UTC

    Dear davido
    First of all, thank you for answer that fast. (:

    Now, the points that still obscure to me:

    Don't use study when the target string is short.
    I have several years of experience coding perl, and this still an obscure point to me: What should I consider as a short string? I'm sure that 20 chars is short -- that's obvious. But how about 400, or even 2000 chars? Is that short?

    Don't use study when you plan only a few matches against the target string.
    Again, I'm in lack of a precise criteria to relay on: what shall I assume as a "few matches"? I know for sure that one or two is obviously "a few". But how many more shall I consider "a few"?

    Don't use study when Perl has no literal text cognizance for the regular expressions that you intend to benefit from the study. Without a known character that must appear in any match, study is useless.
    Sorry, I don't know what is "literal text cognizance", can you please explain it to me? (many thanks in advance!!)

    Once more, thank you very much for care and answer, and thank you very much for sharing your knowledge.

    May the gods bless you.


    "In few words, translating PerlMonks documentation and best articles to other languages is like building a bridge to join other Perl communities into PerlMonks family. This makes the family bigger, the knowledge greather, the parties better and the life easier." -- monsieur_champs

      I was afraid someone might ask what "short" and "long" are in this context, as well as "a few" and "many".... Those are ambiguous quantifiers. Reminds me of my economics classes when professors talked about "shortrun", and "longrun".

      I think that the experts (such as Friedl, in MRE) intentionally don't try to define what is short, long, few, or many. I won't try to second guess their caution about defining thresholds. But I think it's safe to say that in the context of study, a few thousand characters is pretty short. However, the only way to be sure is to benchmark it. And as I implied, bothering with study at all should be a last resort, after exhausting other design options.

      As for "literal text cognizance", the very following sentence defined it: "Without a known character that must appear in any match, study is useless." Literal text cognizance means that unless your regexps are looking for literal text within the string (as opposed to only containing "wildcard" matches), study is useless.

      .....in other words, if your RE contains ONLY the "wildcard" matching constructs such as ". \w \d \s \S \W \D", etc, and doesn't contain literal text, you're wasting your time with study.

      ...for example (warning; silly examples):

      m/\w+\b.?\d*$/; # wouldn't benefit from study. m/abc/; # may benefit from study. m/\d+abc\W.+/; # may benefit from study.


      Dave


      "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
        Along with the disqualifying criteria, already mentioned here, Mr. Friedl does give some useful approximate criteria in the "Owls" book. He describes a sufficiently-long string as being "at least several kilobytes", and gives an example of a useful application as that of checking each chapter in his book, represented as a single string, for "mistaken markup" using a number of re's on that same string. (I presume. He describes it as "a bevy of checks".) He doesn't mention having done any comparative benchmarking of that operation, or how to guess how many re's make it worthwhile.

        For my money, I suspect that the best rule of thumb is the one suggested by sauoq elsewhere in this discussion.