this is the part to index a pdf file

# Checks if a file is PDF depending on the filename. If so, write it t +o a # temporary file and feed it to $PDFTOTEXT, return the output. If it's + not # PDF, return the buffer unmodified. sub parse_pdf { my $buffer = $_[0]; my $url = $_[1]; if ($url =~ m/\.pdf$/i && $PDFTOTEXT) { my $tmpfile = "$TMP_DIR/temp.pdf"; # Saving to a temporary file is necessary for http requested PDFs. + To # keeps things simpler, we also do it for local files from disk. open(TMPFILE, ">$tmpfile") or warn "Cannot write '$tmpfile': $!"; binmode(TMPFILE); print TMPFILE ${$buffer}; close(TMPFILE); # filename security check is done in to_be_ignored(): ${$buffer} = `$PDFTOTEXT "$tmpfile" -` or (warn "Cannot execute '$ +PDFTOTEXT $tmpfile -': $!" and return undef); unlink $tmpfile or warn "Cannot remove '$tmpfile: $!'" } } # Save a term's ID to the database, if it does not yet exist. Return t +he ID. sub record_term { my $term = $_[0]; print STDERR "Warning: record_term($term): No term was supplied\n" u +nless $term; if ($terms_db{$term}) { return $terms_db{$term}; } else { ++$TN; $terms_db{$term} = $TN; return $TN; } } # Is the file listed in @no_index or is it a PDF file with illegal cha +racters # in the filename? # Supported ways to list a file in conf/no_index: # /home/www/test/index.html (absolute path) # /test/index.html (path relative to webroot, but with slash) # test/index.html (path relative to webroot, no slash) # http://localhost/test/index.html (absolute URL) sub to_be_ignored { my $file = shift; # Check @no_index: my $file_relative; $file_relative = cut_document_root($file); foreach my $regexp (@no_index) { if( $file_relative =~ m/^\/?$regexp$/ || $file =~ m/^$regexp$/ ) { return "listed in no_index.txt"; } } # For PDF files check filename for security reasons (it later gets h +anded to a shell!): if( $file =~ m/\.pdf$/i && $PDFTOTEXT ) { if( $file !~ m/^[\/\\a-zA-Z0-9_.:+-]*$/ || $file =~ m/\.\./ ) { return "Ignoring '$file': illegal characters in filename"; } } return undef; }

I search some like this but for .doc .ppt and .xls file...

jcwren - 2001/12/20 22:18:00 UTC - added code tags


In reply to Re: Re: going through a Win32 MSWORD doc by Clancy
in thread going through a Win32 MSWORD doc by Clancy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.