Count Colour Pages in PDF

sureshrps has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Count Colour Pages in PDF by Corion (Patriarch) on Nov 09, 2009 at 09:52 UTC
What do you mean by "without opening the pdf"? You will need to use the open function to read the file. If you don't want to "open a window for the user", that's likely possible. A cursory glance through PDF::API2 doesn't tell anything, so maybe you'll have to inspect all page objects and check whether they are an image ("likely" colored) or using a non-grayscale color.	[reply]
Re: Count Colour Pages in PDF by marto (Cardinal) on Nov 09, 2009 at 09:56 UTC
I seriously doubt you will find a way to examine the contents of a PDF file without actualy opening the PDF in some manner. That isn't going to happen. Are you counting pages which contain colour text/images or do these PDF files consist of one image per page? Perhaps PDF::API2s colourspace methods may be of use. Martin	[reply]
Re: Count Colour Pages in PDF by almut (Canon) on Nov 09, 2009 at 17:40 UTC
I presume that with "without open" you mean "without opening the file in a PDF viewer and eyeballing its page contents" (as others have pointed out, you'd of course have to open/read the file somehow in order to analyse its contents). That said, I'm not aware of an easy way to solve the task with any of the existing PDF modules on CPAN. Even though there may be ways of identifying the color space being used, this doesn't necessarily help to detect if the page is using color — for example, it's rather common to draw nothing but black in the RGB color space... In other words, you'd have to check the effective color of every single PDF drawing/imaging instruction, which would be pretty cumbersome. Personally, I would approach the problem as follows: Convert all pages to raster image format For every image: convert to grayscale convert back to color (RGB) check if there's any difference between the orignal image and the double-converted image the idea being that there is no difference, if the original image/page didn't have any color in the first place. As usual, there are several ways to implement this. One way would be to use Ghostscript and a few of the good old Netpbm tools: `#!/bin/sh infile=$1 prefix=tmp-page rm -f $prefixp?m # clean up # convert pages to raster images gs -sDEVICE=ppmraw -r30 -sOutputFile=$prefix%03d.ppm -dNOPAUSE -dBATCH + -q "$infile" for img in $prefix.ppm ; do # for each page ppmtopgm $img > $img.pgm # convert to grayscale pgmtoppm '#fff' $img.pgm > $img.pgm.ppm # convert back to RGB pnmpsnr $img $img.pgm.ppm # diff done` [download] With the following sample input¹ consisting of three pages (two pages in black&white/gray, one page in color) `%!PS /Helvetica findfont 50 scalefont setfont /text (PerlMonks rocks!) def % page 1 - black 100 500 moveto text show 100 400 moveto text show showpage % page 2 - gray 0.5 setgray 100 500 moveto text show 100 400 moveto text show showpage % page 3 - color (black and red) 0 setgray 100 500 moveto text show 1 0 0 setrgbcolor % red 100 400 moveto text show showpage` [download] you'd get this output (from the diff-ing tool, `pnmpsnr`): pnmpsnr: PSNR between tmp-page001.ppm and tmp-page001.ppm.pgm.ppm: pnmpsnr: Y color component doesn't differ. pnmpsnr: Cb color component doesn't differ. pnmpsnr: Cr color component doesn't differ. pnmpsnr: PSNR between tmp-page002.ppm and tmp-page002.ppm.pgm.ppm: pnmpsnr: Y color component doesn't differ. pnmpsnr: Cb color component doesn't differ. pnmpsnr: Cr color component doesn't differ. pnmpsnr: PSNR between tmp-page003.ppm and tmp-page003.ppm.pgm.ppm: pnmpsnr: Y color component: 81.71 dB pnmpsnr: Cb color component: 35.86 dB pnmpsnr: Cr color component: 26.43 dB [download] "doesn't differ" in both the luminance and color components (see YCbCr) in this case means that there was no color on the respective page. I'll leave it as an exercise for the reader to write a little Perl wrapper that parses this output (and, optionally, rewrite the above shell script in Perl). ___ ¹ I'm using PostScript input here (for brevity) — PDF should work, too, of course (if you don't believe me, you can convert the sample input to PDF using the `ps2pdf` tool that comes with gs :)	[reply] [d/l] [select]