Clancy has asked for the wisdom of the Perl Monks concerning the following question:

I use w2k iis and Active perl, i have to index a site whith documents of this type .doc .ppt .xls and other... i use the Perlfect search engine but i don't have code for this tre type of file... Please help me... Thx...

Replies are listed 'Best First'.
Re: (Get text of Word Document)going through a Win32 MSWORD doc
by buzzcutbuddha (Chaplain) on Dec 20, 2001 at 22:28 UTC
    If what you meant by indexing is grabbing each word in a Word Document and creating an index of those words to find them later, the following will give you all of the words in a document. I'll let you focus on the indexing part. :)

    #!/usr/bin/perl # general use directives use strict; use warnings; # project specific use directives # this comes with the standard ActiveState # distribution. You can also look for # a newer version with PPM use Win32::OLE; my $wd; # get the document # use the full path eval { $wd = Win32::OLE->GetObject('C:/pathto/document/foo.doc') }; die "Unable to load document\n" if $@; # all of the Word document data members I'm using # are explained in the MSDN documentation of the # external interfaces of a Word Document. # if you have MSDN, search for "Word OLE". # get the number of paragraphs my $paraCount = $wd->{Paragraphs}->Count; # set the counter my $foo = 0; my @words; while ($foo++ < $paraCount) { push @words, split /\s/, $wd->{Paragraphs}{$foo}{Range}{Text}; } #clean up at the end undef $wd;
    That's how you get the words of a word document out and into an array. You may prefer a different data structure, but again, I'll leave that up to you! I hope this helps.
Re: going through a Win32 MSWORD doc
by talexb (Chancellor) on Dec 20, 2001 at 22:04 UTC
    Suggestion: Place code inside <code> tags. This makes it easier for us to read. Preview your post until it looks OK.

    After massaging your code a little I was able to see that only some of it pertains to the question at hand. Why not start from scratch? If you have to search for files, use File::Find to get files. Then look at the file extension (because that's how Windows tells files apart) to figure out what to do.

    Unless I've misunderstood what your objective is.

    "Excellent. Release the hounds." -- Monty Burns.

Re: going through a Win32 MSWORD doc
by Gerard (Pilgrim) on Dec 20, 2001 at 19:34 UTC
    Hi there,
    I have done a wee bit of work with Perl and MS Word, not a whole lot, but a wee bit, and I would love to help if I can. Unfortunately, I find it difficult to understand exactly what you would like. Perhaps I am just silly. Do you think that you could rephrase your question, and I will see if I know anything of use.
    Gerard

      this is the part to index a pdf file

      # Checks if a file is PDF depending on the filename. If so, write it t +o a # temporary file and feed it to $PDFTOTEXT, return the output. If it's + not # PDF, return the buffer unmodified. sub parse_pdf { my $buffer = $_[0]; my $url = $_[1]; if ($url =~ m/\.pdf$/i && $PDFTOTEXT) { my $tmpfile = "$TMP_DIR/temp.pdf"; # Saving to a temporary file is necessary for http requested PDFs. + To # keeps things simpler, we also do it for local files from disk. open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile': $!"; binmode(TMPFILE); print TMPFILE ${$buffer}; close(TMPFILE); # filename security check is done in to_be_ignored(): ${$buffer} = `$PDFTOTEXT "$tmpfile" -` or (warn "Cannot execute '$ +PDFTOTEXT $tmpfile -': $!" and return undef); unlink $tmpfile or warn "Cannot remove '$tmpfile: $!'" } } # Save a term's ID to the database, if it does not yet exist. Return t +he ID. sub record_term { my $term = $_[0]; print STDERR "Warning: record_term($term): No term was supplied\n" u +nless $term; if ($terms_db{$term}) { return $terms_db{$term}; } else { ++$TN; $terms_db{$term} = $TN; return $TN; } } # Is the file listed in @no_index or is it a PDF file with illegal cha +racters # in the filename? # Supported ways to list a file in conf/no_index: # /home/www/test/index.html (absolute path) # /test/index.html (path relative to webroot, but with slash) # test/index.html (path relative to webroot, no slash) # http://localhost/test/index.html (absolute URL) sub to_be_ignored { my $file = shift; # Check @no_index: my $file_relative; $file_relative = cut_document_root($file); foreach my $regexp (@no_index) { if( $file_relative =~ m/^\/?$regexp$/ || $file =~ m/^$regexp$/ ) { return "listed in no_index.txt"; } } # For PDF files check filename for security reasons (it later gets h +anded to a shell!): if( $file =~ m/\.pdf$/i && $PDFTOTEXT ) { if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =~ m/\.\./ ) { return "Ignoring '$file': illegal characters in filename"; } } return undef; }

      I search some like this but for .doc .ppt and .xls file...

      jcwren - 2001/12/20 22:18:00 UTC - added code tags