perl_seeker has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,
I have in hand some Perl code which converts a .doc file to a text file( extracts each para):
#!perl -w # # Name: # ole-word-2-text-2.pl. # # Purpose: # Read an MS Word doc and extract each paragraph. Do this in 2 diff +erent ways. # # Test environment: # MS Word 2000 under WinNT. # # Author: # Ron Savage <ron@savage.net.au> # # Home: # http://savage.net.au/Perl-tutorials.html#tut-17 # # Version # 1.00 16-Apr-2000 # # Licence: # Copyright (c) 2000 Ron Savage. # All Programs in this package are free software; you can redistrib +ute # them and/or modify them under the same terms as Perl itself. # Perl's Artistic License is available at: # See licence.txt. use strict; use Win32::OLE qw(in with); # -------------------------------------------------- my($input_file) = 'C:\My Documents\msword\olewordtotext1\tut-17\ole +-word-demo-4.doc'; $input_file = Win32::GetCwd() . "/$input_file" if $input_file ! +~ /^(\w:)?[\/\\]/; die "File $input_file does not exist" unless -f $input_file; my($word) = Win32::OLE->new('Word.Application', 'Quit') || die +"Couldn't run Word"; my($doc) = $word->Documents->Open($input_file); print "Extract using first method: \n"; my($index) = 0; for my $paragraph (in $doc->Paragraphs) { $index++; # Remove trailing ^M (the paragraph marker) from Range. my($text) = substr($paragraph->Range->Text, 0, -1); print "Paragraph: $index. Text: <$text>\n\n"; } print '-' x 50, "\n"; print "Extract using second method: \n"; my($paraCount) = $doc->Paragraphs->Count; for ($index = 1 ; $index <= $doc->Paragraphs->Count ; ++$index) { my($text) = $doc->Paragraphs($index); print "Paragraph: $index. Text: <", $text->Range->Text, ">\n\n"; } print '=' x 50, "\n"; $doc->{Saved} = 1; $doc->Close; # Success. print "Success \n"; exit(0);

Is it possible to do the reverse? Convert the text file back to a .doc file with all the formatting. Any sample code or help would be appreciated.

Thanx

20030611 Edit by Corion: Added formatting and code tags

Replies are listed 'Best First'.
Re: Win32::OLE for MS-Word
by tachyon (Chancellor) on Jun 11, 2003 at 12:21 UTC

    To convert a word doc to text all you need to do is use Word to do it (these subs are straight out of a Proforma document management sytem I worte a while back):

    sub save_doc_as_text { my ( $infile, $outfile ) = @_; require Win32::OLE; my $word = Win32::OLE->new( 'Word.Application', sub {$_[0]->Quit;} + ); error( "Can't create new instance or Word Reason:$Win32::OLE::Last +Error" ) unless $word; $word->{visible} = 0; my $doc = $word->{Documents}->Open($infile); error( "Can't open $infile, Reason:$Win32::OLE::LastError" ) unles +s $doc; # wdFormatDocument wdFormatText wdFormatHTML $doc->SaveAs( { FileName => $outfile, FileFormat => $wdFormatText +} ); $doc->Close; undef $doc; undef $word; }
    Note $wdFormatText is the standard constant which you can get from Win32::OLE::Const but as this plays havoc with warnings I tend to hard code it. It has a value of 2

    The conversion back to a word doc is simple enough for plain text - just reverse the procedure and use word to open the text doc and save it as a word doc.

    If the object of the exercise is say to do a search and replace on the text you can do it Word native like this ($word is a word object):

    sub word_find_and_replace { my ( $word, $rel_file_path, $tokens_ref ) = @_; # first make a temporary file to do the search and replace on my ( $fh, $temp_name ) = get_tempfile( "$DOC_DIR/system", 'doc' ); close $fh; my $content_ref = read_file( "$DOC_DIR/$rel_file_path" ); create_file( "$DOC_DIR/system/$temp_name", $content_ref, 'overwrit +e ok' ); $word->{visible} = 0; my $doc = $word->{Documents}->Open("$DOC_DIR/system/$temp_name"); my $search_obj = $doc->Content->Find; my $replace_obj = $search_obj->Replacement; for my $token ( keys %$tokens_ref ) { my $find = '<?' . $token . '?>'; my $replace = $tokens_ref->{$token}; # now i know this looks wierd but M$ word (at least 2000) want +s \r # as the para marker not \r\n or even \n if you send \n you ge +t little # binary squares..... oh well that's M$ for you. $replace =~ s/\r\n|\n/\r/g; # this makes it work properly. GO +K $search_obj->{Text} = $find; $replace_obj->{Text} = $replace; $search_obj->Execute({Replace => $wdReplaceAll}); } $doc->Save; $doc->Close; # now get the data out of the modified temp file $content_ref = read_file( "$DOC_DIR/system/$temp_name" ); # remove our unwanted temp files and objects unlink "$DOC_DIR/system/$temp_name"; undef $search_obj; undef $replace_obj; undef $doc; return $content_ref; }

    Note if you are doing long search and replaces there is a 255 char buffer overflow that will crash you system or cause wierdness if you are lucky. If you need to insert over 255 chars you need to do a ciper block chaining approach and insert 200 chars plus a token then replace the token with the next 200 chars etc, etc until you have inserted all the text you want to put in.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Win32::OLE for MS-Word
by perl_seeker (Scribe) on Jun 12, 2003 at 11:22 UTC
    Hello,
    thanks a lot for your reply
    I just tried out your first section of codei.e. use Word
    to save a .doc file as a .txt file
    Here's the code:
    sub save_doc_as_text { my ( $infile, $outfile ) = @_; print "\n$infile"; print "\n$outfile"; require Win32::OLE; my $word = Win32::OLE->new( 'Word.Application', sub {$_[0]->Quit;}); error( "Can't create new instance or Word Reason:$Win32::OLE::LastErro +r" ) unless $word; $word->{visible} = 0; my $doc = $word->{Documents}->Open($infile); error( "Can't open $infile, Reason:$Win32::OLE::LastError" ) unless $d +oc; # wdFormatDocument wdFormatText wdFormatHTML $doc->SaveAs( { FileName => $outfile, FileFormat => $wdFormatText}); + $doc->Close; undef $doc; undef $word;} $inf="ole-word-demo-3.doc"; $outf="ole-word-demo-3.txt"; save_doc_as_text($inf,$outf);
    The file ole-word-demo-3.doc is located in the folder
    in which I've stored the above perl script, and executed
    it.Well no ole-word-demo-3.txt file gets created in that
    folder at all.What could be the problem?
    Also, I'm not sure what paras we need to pass to the
    second subroutine(find and replace)
    Actually I need to save a .doc file as a text file,then
    tokenize the text file and process the words in it in some
    way, then convert the .txt file to .doc again
    Also(please treat this as an ignorant soul asking a silly
    question) is there any remote possibility that a .doc
    file can be made to open in a Tk text widget?
    Thanx and regards
      The file ole-word-demo-3.doc is located in the folder in which I've stored the above perl script, and executed it.Well no ole-word-demo-3.txt file gets created in that folder at all.What could be the problem?

      When I test this out, I get an error message indicating that the ole-word-demo-3.doc file cannot be found. If I hard code the location of the document as:

      $inf="C:\\some_dir\\ole-word-demo-3.doc"; $outf="C:\\some_dir\\ole-word-demo-3.txt";
      then the program works to open the specified .doc file and creates the text file. As for the rest of your task, I hope some other Monk will have some insight.
        Yes! It works now and "ole-word-demo-3.txt" gets created.
        The Word formatting characters also appear as unreadable
        stuff in the .txt file though, I suppose they can be
        stripped off.
        For doing the reverse I guess we need to
        replace "wdFormatText" by "wdFormatDocument" in this
        statement:
        $doc->SaveAs( { FileName => $outfile, FileFormat =>wdFormatText
        However I'm not familiar with the Word constants, so I'm
        not sure about the rest of the statements:(
        What are the other changes we need to make to this code?
        Thanks in advance.