Parsing Text

swiftone has asked for the wisdom of the Perl Monks concerning the following question:

I've been given a task at work of placing a policy manual online, with full keyword searching and crosslinking. As part of the task, the client wants to be able to update the manual regularly, and have changes between versions shown in bold. (the changes from the version before can be lost). When I mentioned RCS/CVS, they thought that sounded wonderful.

My recommendation to them was to convert their manual (currently thousands of pages in WordPerfect 5.2) to ASCII text, which would work in just about any word processor, now and in the future. It also works quite well with diff, cvs, and other text tools I have access to :) And of course, it can convert to HTML fairly cleanly.

This is all aided by the fact that they have a set of rules that the document(s) are written by, making automatic conversion plausible.

So I am now looking at converting text to HTML. In particular, I need to be able to recognize nested outline format, and table-like listings. HTML::FromText is nice, but doesn't support this well-enough. I'm currently entertaining either using Parser::RecDescent, or hacking up additions to HTML::FromText. Has anyone tried to do repeated automatic text to HTML conversion on this scale?

Comment on Parsing Text

Replies are listed 'Best First'.
Re: Parsing Text by chromatic (Archbishop) on Sep 12, 2000 at 23:01 UTC
I'm looking at Webmake to manage a site. It involves changing text into HTML, and they have a couple of Perl modules that support that with very rudimentary markup in the text. I'm not sure how robust it is, and how detailed your user data is, but it's strong enough for my purposes. You might have a look.	[reply]
Re: Parsing Text by ZZamboni (Curate) on Sep 13, 2000 at 02:55 UTC
I recently found a program called text2html (one of them, at least), which seems to work pretty well. I used it in much simpler text than what you describe, but it included titles, bulleted paragraphs and the like, and it performed flawlessly. It may be at least a start for what you need. And it comes in module form, so you can just use it and call the appropriate subroutine. --ZZamboni	[reply]
Re: Parsing Text by SuperCruncher (Pilgrim) on Sep 13, 2000 at 02:29 UTC
swiftone, this sounds like a very interesting project. I'm sure I'm not the only monk that'd like to see the source when you get the script done. Converting legacy formats to modern formats such as HTML and XML is an area that I consider Perl would shine in, and this area is only going to become more popular. BTW swiftone, I know exactly what you mean about HTML generated by some progams. How many times have you seen junk like: <b></b><b></b><b></b><b></b><b></b><b></b> in auto-generated HTML? Stallman certainly had it right when he decided to classify certain machine-generated HTML as an "opaque" format under the Free Documentation License. Plus I also imagine that using HTML generated by WP would make the cross-referencing harder. And would it even be possible to save them under programmatic control? Doubt it.	[reply]
Re: Parsing Text by extremely (Priest) on Sep 13, 2000 at 07:32 UTC
If you are putting all the text into CVS anyway, maybe you should look into what the Mozilla guys are doing with Bonsai. It can interactively bring up dual versions of text and more. Pretty sexy and I think it's all perl and GPL'd. -- $you = new YOU; honk() if $you->love(perl)	[reply]
Re: Parsing Text by runrig (Abbot) on Sep 12, 2000 at 22:59 UTC
Can't you just open the WP files and save them as HTML?	[reply]
RE: Re: Parsing Text by swiftone (Curate) on Sep 12, 2000 at 23:03 UTC
Conversion programs tend to do a really horrible job, producing nasty HTML. It's not entirely their fault, they don't know what the content is, but it's generally true. That would not automatically indicate differences between drafts. (I don't know of WPs version controls, but the staff has an older version) Your source format is still closed and vulnerable to change. They are trying to find a system a little more stable.	[reply]