in reply to parse MS Word Template fields for legal documents

Use VB, not Perl. It sounds like that will make your life a lot easier as it has better integration into the Windows world. VBScript is quite powerful, at least for what you need.

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

I shouldn't have to say this, but any code, unless otherwise stated, is untested

  • Comment on Re: parse MS Word Template fields for legal documents

Replies are listed 'Best First'.
Re: Re: parse MS Word Template fields for legal documents
by dimar (Curate) on May 14, 2004 at 00:39 UTC

    Although there are reasons why you might want to stay away from perl for this type of application, you actually may have a very good case (no pun intended) for using perl, even with MSFT Word. Any experienced application developer would suggest you use a database for this type of thing, but you already said you dont have one. Therefore, although its not necessarily the obvious choice, perl may actually be a very good fit for you (assuming you are competent with it).

    Consider these facts:

    TEMPLATING: Perl is probably the best package for developing an easily maintained, well designed templating solution. If you do not 'over engineer' it, you can produce good stuff that works quickly. Moreover, you have *much* better string manipulation and delimiting capabilities than with VB (string manipulation and quoting is one of the biggest annoyances with VB) which plays into any 'fill in the blank' templating system.

    GUI INTERFACE: Perl in combination with a very easy 'front end' will almost certainly be a design requirement in order to make the law office happy. They should not have to know that perl is at the 'guts' of your application. I would recommend using HTA (since it already leverages your knowlege of HTML, as opposed to MSFT office and VBA. Unless you know VB and you don't mind being 'locked' into MSFT office, steer away from VB)

    OFFICE HTML: Most people don't realize this, but you can use perl to easily spit out MSFT office documents by simply saving the documents as MSFT office HTML. This enables you to steer clear of the proprietary binary format while still maintaining the precise formatting that lawyers go nuts over. What this means is that you can build a data driven extensible application that does not require a backend database or any fancy conversion software to output MSFT compatible documents. Again, a backend database is good to have for this kind of thing, but not an absolute requirement.

    REPURPOSE YOUR PERL CODE: This also means that you can *repurpose* your code to output *anything* that supports text (for example, a lawyer will love you when you tell them that your document 'fill in' solution can also be used to help track billable hours, and also send it to their timekeeping software, this will also win you brownie points for being a genius).

      I've tried the Office HTML approach, (with Excel not Word), and it works great, but there are a couple of limitations. Once is that unless you learn MSFT's bizarre XML-ish syntax, you can't use many of the features of these applications (then again, maybe that's a good thing :). Second, though this may not apply to your case, it's hard to tell Office what the types of your data are, which can affect things. Third, for especially large documents, it takes Office longer to process HTML than it does its native formats. If none of these apply to your situation (and they may not), then HTML is my suggestion too.

        Good point. These limitations that Errto mentions can indeed be a major pain in the *fill-the-blank*. Therefore, here is a quick 'step-by-step' guide that may save you a lot of wasted time.

        STEP: Open the 'form letter' MSFT WORD document with the blanks (aka open ClientIntakeFormFoo.doc)

        STEP: Use MSFT WORD to fill in the document with obviously bogus data (e.g. FAKE_FIRSTNAME, FAKE_LASTNAME, FAKE_FOO, FAKE_BAR)

        STEP: Save the filled in document as ClientIntakeFormFoo.htm in MSFT HTML

        STEP: Search thru the file you just saved for every instance of m/FAKE_[^\s]+/

        STEP: replace the sections you found in the previous step with 'quotelike escapes' (e.g., dear, ^.$NAME.q^ we are gonna sue you if you dont pay ^.$AMOUNT.q^ .)

        STEP: enclose the entire html file with an 'outer quotelike' $sOutput = q^ DOCUMENT GOES HERE ^;

        STEP: save the entire htm file as a perl module that you can use with your perl scripts and you are basically done.

        Beware of all the limitations that Errto and others have mentioned, but this is a solution that should work well, because it saves you from having to learn the ugly and complicated MSFT markup. All you have to do is fill in your easily found 'blanks' ignore the rest. Be sure to enclose your document with a single 'quotelike' (not doublequotes), so that perl does not accidentally interpolate anything that occurs inside your file, other than the 'quotelike escapes' that you supplied.
      I dealt with a similar issue. It was legal documents in WA state with line numbers every 3rd line and horizontal and vertical bars at specific measurements and specific widths. I ran into problems when trying to save those as HTML files. The format for the legal documents was very important, and I had a really hard time making the HTML output correctly. Frankly, I never got it to work right.

      So, I ended up using perl to write text files that contained data the user had typed into fields in a web form. When they hit submit IN INTERNET EXPLORER, perl would write a file and then spit out code to make the browser execute the MS Word document and I scripted the mail merge with the data in the file. It was a hackish solution that wouldn't work on a public web page, but it was fine for this two person office.

      The other thing I explored was Adobe Acrobat. They have a scripted way to insert data into fields, and a perl interface already written. If they are willing to splurge on the cost of Acrobat and translate all their documents to PDFs, you could use PDF Forms (FDF?) and easily script them in perl.

      I should add that MSFT OFFICE HTML also allows you to steer clear of Win32::OLE. Not a bad option, but it requires a copy of MSFT Word on the machine and it also adds an extra level of abstraction and indirection that may be difficult to debug. It is almost always preferrable to simply output text. Which can be opened in word, a web browser, or even a competing office package like OO