nothingmuch has asked for the wisdom of the Perl Monks concerning the following question:

I have a funny idea I want to implement, but I don't really know how to start implement it.

It involves taking matches against tree data (that's not too hard to define, it's just regexes with different primitives), and evaluating them on points in the tree (a bit harder?) but not as matches, but as a grammer. The whole thing must be a true parsing solution, that is able to give back twigs of the tree that have matched. This involves backtracking, and nesting, and so on.

On the CPAN there's Data::Match, which seems primitive and unflexible, and a plethora of parsing solutions, which seem very potent, but suited towards strings. For the matching part, I might try to base around Data::Compare which I know and like, but as for the parsing I really don't know...

Is anyone familiar with a parser module to an extent where they can say whether the code can be hijacked to the point that the matching itself, aswell as the input gathering is easily replaceable (that is, the atoms of the grammer and what they are willing to match is my own concern), without having to reimplement the rule composition, backtracking, permuting, and reporting?

Lastly, I would like to be able to prune the search tree based on heuristics. The data can be matched in many ways, i'm concerned about finding one which is good enough, given a user threshold.

-nuffin
zz zZ Z Z #!perl

Replies are listed 'Best First'.
Re: Parsing a tree instead of a string?
by CountZero (Bishop) on Jan 22, 2005 at 13:15 UTC
    Just a wild thought: did you think of XML, XSLT, XPATH and such?

    The data model provides a tree representation of XML documents as well as atomic values such as integers, strings, and booleans, and sequences that may contain both references to nodes in an XML document and atomic values. The result of an XPath expression may be a selection of nodes from the input documents, or an atomic value, or more generally, any sequence allowed by the data model. The name of the language derives from its most distinctive feature, the path expression, which provides a means of hierarchic addressing of the nodes in an XML tree (found here)

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      XPath seems like a useful implementation for the patterns that the rules match. It's rather general and powerful, and I might eventually use it, but it has two problems, that cocnern me where I am now, and provides, as I see it, no solution for either:
      • XPath is not a grammer, it's a way to write patterns, so it will need to be part of a larger scheme
      • XPath will be very hard to performance-tune later. I will probably need to prune the very wide match tree to find the results I'm most happy with in a timely manner. I doubt XPath will let me do it since it's possibly a bit too powerful.
      The second problem is solvable provided I can ask per node whether rule $some_xpath_based_rule applies to it, and then memoize the results, but trying to find a good match by pruning, like game-playing searches do is not very trivial since it is rather closed.

      I think I may eventually use a subset of XPath, where some of it's aspects/features are actually pushed up to the grammer engine level, and at that point I might have something useful.

      Anyway, my answer is "yes, i've thought of that, but I'm not really sure what I'm going to do about those thoughts yet" ;-).

      Thanks for your time!

      -nuffin
      zz zZ Z Z #!perl
Re: Parsing a tree instead of a string?
by mirod (Canon) on Jan 22, 2005 at 14:17 UTC

    XML::Twig has a method that might implement a subset of what you are looking for: wrap_children. Look for it in the code and in the docs. It lets you give a regexp on the children of an element, for exemple $doc->wrap_children( '<title><para>*', 'section') will wrap a title followed by para's in a section element.

    At least it will give you one possible way of solving your problem: serialize the tree (in the case of wrap_children not all the tree is serialized, just the start tags of children), and then buid a regular expression that you can apply to that serialized tree. This way you don't have to re-invent the regexp engine wheel, which frankly would scare me!

Re: Parsing a tree instead of a string?
by ihb (Deacon) on Jan 22, 2005 at 16:55 UTC

    According to Ovid who has looked into the different logic programming modules on CPAN it doesn't seem like any of them do the job that well. Otherwise I'd recommend you to look into logic programming. Ovid has written some nodes about it, perhaps you want to look at Bringing Logic Programming to Perl. One of Prolog 's problem domains is grammar parsing, so building trees and matching against them is a job it does well. Perhaps you can even utilize an external Prolog program in your Perl program?

    ihb

    See perltoc if you don't know which perldoc to read!
    Read argumentation in its context!

      This, and as Corion mentioned to me on the CB, lisp's pattern matching, seems like the way to go.

      It also seems like the most interesting approach too, but the problem is that it might take long to learn everything. Oh well, this is just a hobby project, and since I can't devote too much time to it anyway, it's taking long already. ;-)

      -nuffin
      zz zZ Z Z #!perl
Re: Parsing a tree instead of a string?
by ysth (Canon) on Jan 23, 2005 at 09:33 UTC
    hmmm...

    overload opendir, readdir, and stat, and use File::Find?

    hmmm...

      I guess that can be used for simple pattern matching, but i need more... Backtracking, nesting of patterns, etc.

      I also guess that the Tree:: family of modules has some implementation akin to File::Find's traversal and extraction.

      Nice idea though ;-)

      -nuffin
      zz zZ Z Z #!perl