I think your approach will be worth following, but I have not seem something like this until now.

The big difference between your proposal and the existing modules, if I understand you correctly, is that you want to create (e.g., with this "tagblock" syntax) a special-purpose regex-based parser that ignores nearly all of the document and just dives into the parts that match the input specification. This could of course be faster than a full parse of the whole document tree.

The main point here is: what will be the grammar for your input specification syntax? If you could post some proposal for it in a more formal notation (e.g., BNF), it would be interesting to work on this problem.

From a performance point, the result of the compiled specification syntax should most surely not be one big regex (a lot of alternation etc. will make it slow, and the "Little Engine that couldn't" problems looms, too), but a closure that performs the needed logical tests in concert with the generated regexes.

The main point here is of course that using your approach people that would not be able to use regexes at all will be able to do some quite sophisticated regex matching. Maybe this warrants some search, too - any "user friendly" regex specification packages out there?

Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com


In reply to RE: A grammar for HTML matching by clemburg
in thread A grammar for HTML matching by mcelrath

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.