Parse::RecDescent grammar for RTF

Courage has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parse::RecDescent grammar for RTF by Willard B. Trophy (Hermit) on Nov 02, 2002 at 13:50 UTC
I don't know of an RTF grammar for Parse::RecDescent. RTF is a bit of a mess, structurally, so parsing isn't trivial. Even major applications write RTF that isn't quite standard. I, too, have been burnt by the RTF parser on CPAN (RTF::Parser); it nearly does what I wanted, but is exceedingly hard to customize. Basically, if RTF::Parser doesn't do exactly what you want out of the box (and it can; its HTML output is pretty cool), look elsewhere. Ths speaks the voice of bitter experience. A low-level solution which works for me is RTF::Tokenizer, on which I've based a production system for converting RTF dictionary data to Quark XPress tags. RTF::Tokenizer has its quirks; give me a yell if you need help. -- $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"	[reply]
Re: Parse::RecDescent grammar for RTF by PodMaster (Abbot) on Nov 02, 2002 at 15:15 UTC
How's your C knowledge? You can find the RTF specification at http://www.wotsit.org/ (like you can most any file format a programmer might need help with), and it includes a sample c reader (Appendix A). It doesn't look like it'd be too hard to develop a grammar, although it looks like it'd be lots easier to just develop a parser ;) Anybody looking for a project this smells like a good one. update: If you're looking for strategy, try looking at a Latex parser, cause LateX and RTF look very similar if you ask me. I'm suprised there isn't a opensource library already out there to do this (i know there is a non-free one that looks like it'd be useful). `____________________________________________________` ** The Third rule of perl club is a statement of fact: pod is sexy.	[reply]
Re: Re: Parse::RecDescent grammar for RTF by Willard B. Trophy (Hermit) on Nov 02, 2002 at 21:43 UTC
I'd revise your statement: > Anybody looking for a project this smells like a good one to: Anybody looking for a project, this smells. RTF is an unpleasant format. The basic spec might be okay for creating a writer, but creating a reader that will handle arbitrary RTF is quite hard. Thus speaks someone who has just spent the last five months dealing with RTF parsing. -- $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"	[reply]
Re: Re: Parse::RecDescent grammar for RTF by Courage (Parson) on Nov 03, 2002 at 09:47 UTC
Thank you for a very interesting link, I'll save it for a future. As it looks like ready-to-use grammar currently does not exists, I'll try writing one by myself and will show it on this site. However, I expect it to be extremely slow on parsing. I'll let you know about my further results. Courage, the Cowardly Dog	[reply]
Re: Parse::RecDescent grammar for RTF by graff (Chancellor) on Nov 02, 2002 at 13:48 UTC
Having installed RTF::Parser on my linux laptop just now, and seeing the README page for that module, it doesn't look like it's in a usable state -- e.g. there doesn't appear to be any documentation for how to use it. (And there wasn't any explicit mention of how old it is -- but the downloaded files under .cpan/build date from July 1999.) This looks like a dead-end, orphaned module. (Should CPAN have something like a garbage-collection process to clear away stuff like this?) On the other hand, RTF::Tokenizer seems to be quite current, and is documented. You didn't mention what you need to do, but maybe this module will be able to help you out.	[reply]
Re: Re: Parse::RecDescent grammar for RTF by Willard B. Trophy (Hermit) on Nov 02, 2002 at 13:54 UTC
I've chatted with Pete Sergeant; I don't think that RTF::Tokenizer will be developed further. It is useful, though, if you are careful to preprocess special characters before running the tokenizer. It really, really doesn't like \~ codes for non-break spaces. -- $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"	[reply]