binarybits has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,

This is my first Perlmonks question. Hope I'm in the right place.

I've got a large corpus of documents and I'm trying to write a script that will find all the documents containing black redaction rectangles. Example:

http://www.limathreefive.com/pdf/GBA.pdf

I've spent a couple of days playing around with CAM::PDF, and I've managed to detect some rectangles by using PDF::CAM's parsing code (using CAM::PDF::Renderer::Text as a model) and then looking for and interpreting relevant PDF operators (re, m, l, f, etc). This works OK, but it doesn't detect all the relevant rectangles (like the ones in the files above), and it involves me writing a lot of low-level code. I'm sure there are subtleties in the PDF spec I'm not taking into account, and that this will cause me to miss some files.

So: are there other libraries I should be using? I've looked at the API2 documentation and it doesn't appear to handle this sort of thing better than CAM does. Or does CAM have other helpful (ideally higher-level) functions I ought to be using?

Thanks!

-Tim

Replies are listed 'Best First'.
Re: Finding Rectangles in PDFs
by almut (Canon) on Jan 12, 2010 at 22:48 UTC

    I'm not aware of any canned module for extracting black rectangles from PDFs.  I could be wrong of course, but I'm afraid you'll have to do some low-level coding yourself, similar in spirit to what you've already tried, i.e. searching the content streams for re commands, or certain combinations of line+fill operators.

    Personally, when it comes to low-level messing with PDFs, I'm a big fan of pdftk, as it allows easy uncompressing of the PDF's content streams. For example, doing the following

    $ pdftk GBA.pdf output - uncompress | grep ' re$' | wc 4730 23650 132978

    counts 4730 rectangle drawing instructions (though they may of course not all qualify as redaction rectangles...)

    Anyhow, what's the idea behind extracting those rectangles, i.e. what are you really trying to do? Maybe there is some entirely different approach to solving your problem.

      Forgive me for being a pedantic jerk, I really only say this for a fuller understanding of Unix tools. But if all you care about is a count of the number of matches that grep finds, you can simply use the -c flag of grep instead of piping the output to wc.

      $ pdftk GBA.pdf output - uncompress | grep -c ' re$'

      See also: UUOC "wc -l" section.

      Thanks for suggesting pdftk! That looks super helpful.

      I'm doing research on privacy in court judicial documents as part of the RECAP project. I've got a corpus of ~1 million US federal court documents, and I'm trying to categorize them as redacted or unredacted for use in subsequent analysis. The long-run goal is to develop tools for computer-assisted document redaction.

      One of the challenges is that the documents were created by many different parties using different software and techniques, so I'm probably going to have to look for a number of different patterns to find all (or at least most) of them. So I want the techniques I use to be as general as possible to make sure I catch as many documents as possible.

      Thanks again for your help!