Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Can anyone help me on my problem given below . I need to open doc,pdf,xls and ppt files and read the number of pages that they contain . How can I access the numpages of doc,excel,pdf files . Please let me know some clue on my above problem . Thankyou !
  • Comment on Open a word(.doc) file and output the number of pages contained in it

Replies are listed 'Best First'.
Re: Open a word(.doc) file and output the number of pages contained in it
by Corion (Patriarch) on Nov 26, 2003 at 11:18 UTC

    This information is, at least for the case of Word, provided by the MS Word Object Model. For the other cases, I don't know, especially for the .pdf files, as I don't know whether Acrobat Reader provides an OLE interface to the application and its document. The best way in my opinion is still to print the documents to a printer and then count the pages of the output. That printer does not necessarily need to be a real paper printer, but for example a .pdf printer, if you know how to get the number of pages in a PDF file.

    For printing any document under Windows, take a look at the shell commands provided in the registry of Explorer and/or at the facilities for automating (Office) applications provided by Win32::OLE. For getting the count of pages out of a PDF, I would look at the (various) PDF modules that CPAN provides.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Open a word(.doc) file and output the number of pages contained in it
by wine (Scribe) on Nov 26, 2003 at 13:05 UTC
    For PDF files the following one liner might help you. I didn't do any decent rearranging of the code:
    perl -ne 'undef $/; $\="\n"; (($count = $1) =~ s/.*\Count\s*(\d+).*/\1/ && $pages < $count ? $pages=$1 : 1) while s/(<<[^<]*\/Type\s*\/Pages[^>]*>>)//; print $pages' somefile.pdf

    I tested it one some pdf files and it seems to work. It is not a very elegant solutions and might be heavy for large files. It reads out the sections like the following in the PDF file and assumes that the larges count holds the number of pages, which is really an assumption.

    23 0 obj << /Type /Pages /Resources 76 0 R /MediaBox [ 0 0 595 842 ] /Kids [ 24 0 R // and so on ] /Count 11 >> endobj

    - wine

Re: Open a word(.doc) file and output the number of pages contained in it
by Paulster2 (Priest) on Nov 26, 2003 at 11:26 UTC

    I don't know exactly, but in another node it said that PDF format files have a basis in postscript. You may want to look in that direction for information on getting the page count with that format. Hope that helps a little.

    Paulster2

Re: Open a word(.doc) file and output the number of pages contained in it
by guha (Priest) on Nov 26, 2003 at 14:49 UTC

    If you're on Win and have Word installed you can use Win32::OLE as Corion indicated above.

    Below is a working example

    #!perl -w use strict; use Win32::OLE; use Win32::OLE::Const; use Win32::OLE::Variant; ## Start Word engine my $Word = Win32::OLE->new('Word.Application', 'Quit'); $Word->{'DisplayAlerts'} = 0; my $wdc = Win32::OLE::Const->Load("Microsoft Word"); my $tpl = "c:\\perltest\\pages.doc"; ## Subject to change ## Open template my $doc = $Word->Documents->Open($tpl, {ReadOnly => Variant(VT_BOOL, +1) } ) || die"add"; my $pages = $Word->ActiveDocument->BuiltInDocumentProperties( $wdc->{w +dPropertyPages} )->Value; $doc->Close( { SaveChanges => $wdc->{wdDoNotSaveChanges} } ); print " $tpl contains $pages pages\n";
    HTH