xorl has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to convert some Word .doc files to Adobe .pdf. In the process we're also changing the naming conventions so foo.doc might become bar.pdf as well as storing the new filename and path in a database.

We're not a Windows shop and don't have Word. So all the fancy stuff you might be able to do with the Win32::OLE modules isn't possible.

What I envisoned so far with the script is this:
1) Crawl specific directories
2) Pick out the .doc file
3) query the DB to figure out the new file name
4) convert the .doc file to .pdf
5) move the .pdf to the correct location
6) delete the old .doc file
7) update the DB with the new info
8) loop back to the start

With perl, I can do all except 4) convert the .doc file to .pdf. So first thing I did was check cpan.org. I didn't seem to find anything dealing with .doc to .pdf conversion. I found the PDF and PDF::API2 modules (I've used them before and I'm not thrilled to see them again). Everything I could find regarding Word seemed to be related to the Win32::OLE modules.

So any suggestions for converting .doc to .pdf on Linux?

Replies are listed 'Best First'.
Re: Convert .doc to .pdf
by Corion (Patriarch) on Jan 31, 2007 at 17:58 UTC

    The easiest way is to buy MS Office and Windows, install both in a virtual machine and use that. A comparable way is likely to buy Adobe Acrobat or Adobe Distiller to convert your Word documents to PDF. Maybe Adobe doesn't insist on Windows.

    A different way might be to try to automate OpenOffice as it as import filters for Word and export filters for PDF. Unfortunately, OpenOffice is bad to automate unless you like Java and the object model that Java tends to impose.

    The most ugly but in the long term most beneficial approach would be to extract the import and export filters from OpenOffice and turn them into Perl extensions or at least command line programs to en- or decode as you want.

      Unfortunately, OpenOffice is bad to automate unless you like Java and the object model that Java tends to impose.

      You can also use Python to drive OpenOffice (not that that's much better than Java, mind you . . . :). I don't recall where I found the sample code I based what I wrote (a converter which munged SXC XML files (which had been run through Template Toolkit) into Excel XLS files), but the Python page in the OO wiki may get you started.

      Update: Aaah, found the links to more examples: http://udk.openoffice.org/python/python-bridge.html

Re: Convert .doc to .pdf
by BrowserUk (Patriarch) on Jan 31, 2007 at 18:24 UTC
Re: Convert .doc to .pdf
by klekker (Pilgrim) on Jan 31, 2007 at 19:35 UTC
Re: Convert .doc to .pdf
by ww (Archbishop) on Jan 31, 2007 at 18:35 UTC
    Or, perhaps, you might elaborate step 4):
      a. Open word doc with OpenOffice (does the job very nicely)
      b) Tell OO to save as .pdf (in some appropriate place)
    continue with step 5)

    UPDATE: Missed Corion's and Fletchs mention/deprecation of this idea, but OO2.x both reads Word .docs reliably and has option for .pdf output... which might even be worth the pain of using java to achieve the the minimal automation required in 4b and 5 ... if, in fact, there's no public API to interface with Perl.

      ...and believe me, it pains me to say that.
                        <;-)

      I have done this at $work, unfortunately, I am unable to release it outside of my cubicle walls. But I will say that it is based heavily on the code sample found at http://www.codeproject.com/office/PortableOpenOffice.asp, and hooked into an Apache server via a CGI call. Oh, just so it applies to PerlMonks, the CGI wrapper is a Perl script that does some pre and post processing on the file validation testing, meta-data fillin, etc.

      --MidLifeXis

        Very nice.

        I just tried this with OO 2.0.2

        Built the macro, as described in the article you linked to.

        Since I was not interested in a CGI wrapper, the only interesting stuff in the the ASP/C# stuff in the download acompanying the article is how to call it (Win here):

        path_to_OO_executables\swriter.exe macro:///ConversionLibrary.PDFConve +rsion.ConvertWordToPDF(Word.doc,Output.pdf)

        (I used the same names as in the article)

        Assemble that command line dynamically with the file names needed, run it as a background process, and you are done.

        Worked fine, except for an "ErrorCodeIOException" occuring at the export call...

        Hm, took me a few minutes to realise, that it was not my fault, but a known bug in V2.0.2. ;-/

Re: Convert .doc to .pdf
by dragonchild (Archbishop) on Jan 31, 2007 at 19:08 UTC
    The biggest issue in all of this is parsing the .doc - if you can do that, then you can create a PS file and use ps2pdf to finish it off. So, you really want to be looking for something that parses doc -> ps. And, given that MS has been very very tight with the .doc format, it may not be doable.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Convert .doc to .pdf
by glasswalk3r (Friar) on Feb 01, 2007 at 12:42 UTC

    Maybe you could use OpenOffice and it's internal macro language to do the convertion job.

    Alceu Rodrigues de Freitas Junior
    ---------------------------------
    "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill
      Uh, rather than write code, how about you tell openoffice to write the document to the PDF printer, like: /usr/local/OOffice1.1.5/soffice -pt "PDF" somefile.doc The "-pt" says to print this document to the specified printer.
Re: Convert .doc to .pdf
by sgt (Deacon) on Jan 31, 2007 at 20:29 UTC

    another possible chain would be .doc -> .html ->.pdf

    hth --stephan