Parsing to a format-neutral document model?

xdg has asked for the wisdom of the Perl Monks concerning the following question:

I'm brainstorming some ideas around a project involving translating various structured, text-markup languages to and from each other. (E.g. HTML, XML, Wiki, POD, RSS, etc.) I'm pretty much new to this kind of document/data parsing and would like to get some pointers before I dive headlong into it.

As I've browsed many existing modules on CPAN around these topics, my impression is that many of them are centered around a particular format (e.g. XML::Parser, XML::DOM), either parsing that format or converting an existing parse tree in that format to various other formats, or both. The other thing I've observed is that several parsers/modules are tightly coupled within a larger project. E.g. a full wiki installation as with Kwiki::Formatter or broader library as with libxml-perl.

I'd like to find something a little more open-ended and general-purpose. Conceptually, I'm thinking about the problem in three parts:

A parser based on a particular grammer for a type of markup or markup dialect
A perl data structure that encapsulates the document structure
A translator that converts from the perl data structure to any particular desired output

For the parser, I'm not sure whether a bottom up or top-down approach is best. I'm somewhat tempted to use this an an opportunity to learn Parse::RecDescent, which seems like it would effective for these kinds of documents with sections, paragraphs, inline markup, etc... Are there other suggestions? I'd like to avoid external, non-perl tools, but something like Parse::YAPP could be ok.

For the data structure, I'm debating between converting everything to some standard DOM (e.g. XML::DOM, Mozilla::DOM) or equivalent "grove" (e.g. Data::Grove and the like) or rolling my own generic document tree structure using tools like Tree::Simple or Data::Hierarchy. A standards-based approach seems appealing to be able to leverage tools built on the standard, but I'm worried about a lack of flexibility and burdening the dependency chain with a DOM written for too narrow a purpose. (E.g. XML::DOM requires LWP::UserAgent and also XML::Parser which itself depends on the the "expat" library.)

For the translator, the approach pretty much depends on the data structure. If it winds up in a standards-based structure, then I can leverage tools to manipulate that standard. Otherwise, the output formatting would have to be written based on traversal of the data structure. (This assumes a document model approach as opposed to a SAX-style approach.)

I'm looking for your general thoughts about how I'm structuring this problem, or how you've tackled similar problems. Am I over-engineering a solution? Should I just look to use existing CPAN modules? Either way, I'm also looking for any recommendations for modules that you have found particularly helpful at tackling these kinds of parsing and document model situations. What kinds of lessons learned can you share that would get me started on the right foot?

Thanks very much,

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Comment on Parsing to a format-neutral document model?

Replies are listed 'Best First'.
Re: Parsing to a format-neutral document model? by dragonchild (Archbishop) on Sep 12, 2005 at 20:01 UTC
What problem are you trying to solve? What system(s) do you expect to be communicating with? From reading your post, it sounds like you're looking to provide a solution and are looking for the problem to solve. My experience has usually been with taking data in formats that I don't choose, parsing them into a format my code understands, doing something with it, then emitting the results back out in formats that I don't choose. The fact that I can do that easily with Perl + CPAN has been a great strength for my employability. Data is organized in formats for a number of different reasons, and the format is chosen to enable the code using that data to do its job more efficiently. For example, you wouldn't organize something in a Word document if your primary purpose is CRUD. But, the Word format is more useful for WYSIWIG editing than a relational schema. Just a few thoughts ... My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: Parsing to a format-neutral document model? by xdg (Monsignor) on Sep 12, 2005 at 20:58 UTC
Good questions. I phrased it as a "brainstorm" as that's where I'm currently at. I've got some ideas of what I want to do and I'm trying to figure out which of the many ways to do it I want to start with. Some of my goal is educational -- so expediency of solution isn't a top priority. (A rare luxury.) The genesis of my question came from the documentation that doesn't suck thread, which reminded me of a halfway-started, never-completed project of mine from a year or two ago. What I'd like to do is replace my module's POD with some sort of wikitext, wrapped in `=begin wiki/=end` blocks, and then pre-process those blocks during the module build process to create separate .pod files containing matching pod. It's similar to what ingy has been thinking about for Perldoc and Kwid -- only I'm not sure I'm willing to wait until that's done and documented. In evaluating CPAN, I can find modules for wiki-to-HTML (though often tightly-coupled), for pod-to-wiki, for html-to-pod, and many others that are less-well documented and harder to sort through. The "easy" approach is to string together a wiki-to-html processor and an html-to-pod processor, but that makes the output dependent on the chain of tools and their idiosyncrasies. CPAN is great for getting something done and working, but doesn't always get it done exactly the way that you want. It got me thinking about whether I should write my own narrowly-focused wiki-to-pod translator and that got me thinking about whether I should instead write the tool that I had really been hoping to find on CPAN which was a generic wiki parser that could have various wiki grammers plugged into it and which spit out a document model that could be subsequently manipulated or turned into output. If I get around to tackling it, I'd probably start with simple, existing tools that got the job done even if it wasn't exactly what I wanted (code development being so darned personal) and work out from there, but I was hoping to get more general insights into whether I was even thinking about the longer-term approach to this kind of parsing problem in the right way. Does that clarify? I asked more vaguely first because I'm more interested in the general insights than the solution to the narrow problem, for which I'm confident I can cobble a solution. Probably I should have explained this in the first place. -xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply] [d/l]