comment on

I'm brainstorming some ideas around a project involving translating various structured, text-markup languages to and from each other. (E.g. HTML, XML, Wiki, POD, RSS, etc.) I'm pretty much new to this kind of document/data parsing and would like to get some pointers before I dive headlong into it.

As I've browsed many existing modules on CPAN around these topics, my impression is that many of them are centered around a particular format (e.g. XML::Parser, XML::DOM), either parsing that format or converting an existing parse tree in that format to various other formats, or both. The other thing I've observed is that several parsers/modules are tightly coupled within a larger project. E.g. a full wiki installation as with Kwiki::Formatter or broader library as with libxml-perl.

I'd like to find something a little more open-ended and general-purpose. Conceptually, I'm thinking about the problem in three parts:

A parser based on a particular grammer for a type of markup or markup dialect
A perl data structure that encapsulates the document structure
A translator that converts from the perl data structure to any particular desired output

For the parser, I'm not sure whether a bottom up or top-down approach is best. I'm somewhat tempted to use this an an opportunity to learn Parse::RecDescent, which seems like it would effective for these kinds of documents with sections, paragraphs, inline markup, etc... Are there other suggestions? I'd like to avoid external, non-perl tools, but something like Parse::YAPP could be ok.

For the data structure, I'm debating between converting everything to some standard DOM (e.g. XML::DOM, Mozilla::DOM) or equivalent "grove" (e.g. Data::Grove and the like) or rolling my own generic document tree structure using tools like Tree::Simple or Data::Hierarchy. A standards-based approach seems appealing to be able to leverage tools built on the standard, but I'm worried about a lack of flexibility and burdening the dependency chain with a DOM written for too narrow a purpose. (E.g. XML::DOM requires LWP::UserAgent and also XML::Parser which itself depends on the the "expat" library.)

For the translator, the approach pretty much depends on the data structure. If it winds up in a standards-based structure, then I can leverage tools to manipulate that standard. Otherwise, the output formatting would have to be written based on traversal of the data structure. (This assumes a document model approach as opposed to a SAX-style approach.)

I'm looking for your general thoughts about how I'm structuring this problem, or how you've tackled similar problems. Am I over-engineering a solution? Should I just look to use existing CPAN modules? Either way, I'm also looking for any recommendations for modules that you have found particularly helpful at tackling these kinds of parsing and document model situations. What kinds of lessons learned can you share that would get me started on the right foot?

Thanks very much,

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

In reply to Parsing to a format-neutral document model? by xdg

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.