dwhitney has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am tring to read and parse an M$ word doc, pulling out the title, text body and any bolding or itallics formatting.
I've seen a bunch references to the Win32::OLE package, which looks great, but the problem is this package seems to rely on some sort of user interaction and/or having the M$ apps installed locally.

Are there any other methods out there to do this like how Spreadsheet::WriteExcel works with Excel?

Thanks!
Dennis

  • Comment on Need help with perl only parsing of M$ word file

Replies are listed 'Best First'.
Re: Need help with perl only parsing of M$ word file
by bart (Canon) on Sep 05, 2003 at 01:10 UTC
      Win32::OLE works not only with MS-Office programs, but with any OLE program that supports IDispatch.

      Otherwise whole bunch of my utilities program will never work.
      I happen to manage Adobe Acrobat, Trados and many other programs using Win32::OLE .

      Courage, the Cowardly Dog

        I think the point bart was making was that Win32::OLE won't help you read a MSWord document if you don't have Word installed.

Re: Need help with perl only parsing of M$ word file
by dwhitney (Beadle) on Sep 04, 2003 at 22:26 UTC
    Update: Found http://wvware.sourceforge.net/
    Checking this out now.
      http://wvware.sourceforge.net/ did'nt help.... Bummer.... I guess I'll write one in my copius free time. Thanks for all of your sugestions and pointers!
Re: Need help with perl only parsing of M$ word file
by Brutha (Friar) on Sep 05, 2003 at 08:18 UTC
    Dennis,

    I scan a directory tree of word files for creation of an index with SWISH-E.

    I use Win32::OLE and have M$-Word installed, but I do not need any interaction and Word does not have to be visible. This is bound to windows. Are you dependend on the windows platform? All tools I found were not exactly what I need, many come from the unix world and depend on these handy gnu libraries, but I am on Windows here.

    Be aware, that after extracting the text you might still have lots of control characters forming tables etc. I am not interested in bold or italic text, but extract title and other document properties, user-defined properties and text.

    My solution was straight forward as with every OLE interaction I have written in Perl. You open the application and the macro editor, press F1 to find the functions, record macros, save the VB-Script and translate and extend it to Perl, cutting its length to the half.

    If somebody is interested, I could post my code as a starting point.

    regards Brutha

    And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
    (Terry Pratchett, Small Gods)

Re: Need help with perl only parsing of M$ word file
by Aragorn (Curate) on Sep 05, 2003 at 11:06 UTC
    It's not Perl-only, but maybe antiword can help you. It's a program which can read Word files and output plain text and PostScript files. The bolding/italic formatting is not preserved, but it does a pretty good job of getting the text out of it.

    Arjen

Re: Need help with perl only parsing of M$ word file
by Kanji (Parson) on Sep 04, 2003 at 21:16 UTC