in reply to Finding Rectangles in PDFs

I'm not aware of any canned module for extracting black rectangles from PDFs.  I could be wrong of course, but I'm afraid you'll have to do some low-level coding yourself, similar in spirit to what you've already tried, i.e. searching the content streams for re commands, or certain combinations of line+fill operators.

Personally, when it comes to low-level messing with PDFs, I'm a big fan of pdftk, as it allows easy uncompressing of the PDF's content streams. For example, doing the following

$ pdftk GBA.pdf output - uncompress | grep ' re$' | wc 4730 23650 132978

counts 4730 rectangle drawing instructions (though they may of course not all qualify as redaction rectangles...)

Anyhow, what's the idea behind extracting those rectangles, i.e. what are you really trying to do? Maybe there is some entirely different approach to solving your problem.

Replies are listed 'Best First'.
Re^2: Finding Rectangles in PDFs
by jffry (Hermit) on Jan 13, 2010 at 14:50 UTC

    Forgive me for being a pedantic jerk, I really only say this for a fuller understanding of Unix tools. But if all you care about is a count of the number of matches that grep finds, you can simply use the -c flag of grep instead of piping the output to wc.

    $ pdftk GBA.pdf output - uncompress | grep -c ' re$'

    See also: UUOC "wc -l" section.

Re^2: Finding Rectangles in PDFs
by Anonymous Monk on Jan 13, 2010 at 16:50 UTC

    Thanks for suggesting pdftk! That looks super helpful.

    I'm doing research on privacy in court judicial documents as part of the RECAP project. I've got a corpus of ~1 million US federal court documents, and I'm trying to categorize them as redacted or unredacted for use in subsequent analysis. The long-run goal is to develop tools for computer-assisted document redaction.

    One of the challenges is that the documents were created by many different parties using different software and techniques, so I'm probably going to have to look for a number of different patterns to find all (or at least most) of them. So I want the techniques I use to be as general as possible to make sure I catch as many documents as possible.

    Thanks again for your help!