cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks. I am having a strange problem with Win32::OLE and Word. Running this simple script:
use strict; use Win32::OLE; use Win32::OLE::Const 'Microsoft Word'; my $filename = 'C:\temp\test.doc'; my $word = Win32::OLE->new('Word.Application', 'Quit'); my $doc = $word->Documents->Open($filename) || die("Unable to open doc +ument ", Win32::OLE->LastError()); my $nwords = $doc->Words->Count; my @wordtext; my @wordcolor; my $starttime = time; for(my $i = 1; $i <= $nwords; $i++) { $wordcolor[$i] = $doc->Words->Item($i)->HighlightColorIndex; $wordtext[$i] = $doc->Words->Item($i)->Text; } my $elapsed = time - $starttime; printf "\n\n%6.0D words\n%6.0D sec\n%5.3f per word", $nwords, $elapsed +, $elapsed/$nwords; $doc->Close; undef $word;
on a 30 page paper results in a processing rate of 0.158 sec/word, or about 379 words per minute on a P4 3.4 ghz with many gigs of memory. Now I know OLE is a dog, but that's about human reading speed! Clearly something is wrong.

The stranger thing is that I did some experimenting, and the words-per-second rate is a linear function of the number of words in the document. In other words, the reading rate is twice as fast for a document half the size, twice as slow for a document twice the size, and so on. This one really baffles me bc I can't imagine why the size of the document would influence the reading speed unless it has to count from the beginning every time you request a word, and surely they didn't design the method like that.

So I checked the archives and found this thread about Word slowness and tried some of the suggested fixes. Disbling the Word macro checker and diabling the DEP helped maybe a little but didn't solve the problem.

Anybody know how I can speed this thing up? Can I get the text and color info in larger chunks, maybe?

Many thanks....

Steve

Replies are listed 'Best First'.
Re: Win32 Word behaving strangely
by vkon (Curate) on Jul 28, 2006 at 20:29 UTC
    the reason of slowness is in your seemingly innocent  Item($i)

    This construct in MS-Word is not an array-like access, but rather it goes every time from 1 to $i to access $i-th element

    This gives access time O(N) instead of O(1), and your loop O(N^2), which is bad.

    Plus you use this several times in a loop.

    A typical workaround - use something like that:

    use Win32::OLE qw(in); @items = in $doc->Words->Items;

    Been there, seen that :)

      I knew it had to be something like that. But I tried your fix, a la
      use strict; use Win32::OLE qw( in ); use Win32::OLE::Const 'Microsoft Word'; my $filename = 'E:\test.doc'; my $word = Win32::OLE->new('Word.Application', 'Quit'); my $doc = $word->Documents->Open($filename) || die("Unable to open doc +ument ", Win32::OLE->LastError()); my $nwords = $doc->Words->Count; my @wordtext; my @wordcolor; my $starttime = time; my @items = in $doc->Words->Items; for(my $i = 1; $i <= $nwords; $i++) { $wordcolor[$i] = $items[$i]->HighlightColorIndex; $wordtext[$i] = $items[$i]->Text; }

      and @items comes back null.

        'in' do not always work good...
        But you can try work at paragraph level, so
        my @paras = in $doc->Paragraphs;# or so

        BTW even $doc->Words->Count will traverse entire document content... so be careful!

        Also I would suggest you to slurp entire document content... but then it will be difficult to get access to word in the middle... so this may be not a good advice...

        try 'in $doc->Words;' instead of 'in $doc->Words->Items;'
Re: Win32 Word behaving strangely
by holli (Abbot) on Jul 28, 2006 at 19:06 UTC
    Using your code, I measure 0.01 - 0.03 seconds per word on my computer (with a 2GHz pentium and 256MB RAM). That's still dead slow but 10 times faster than yours. Of course that doesn't help you much. See it as a datapoint ;)


    holli, /regexed monk/