in reply to Re: Finding Rectangles in PDFs
in thread Finding Rectangles in PDFs

Thanks for suggesting pdftk! That looks super helpful.

I'm doing research on privacy in court judicial documents as part of the RECAP project. I've got a corpus of ~1 million US federal court documents, and I'm trying to categorize them as redacted or unredacted for use in subsequent analysis. The long-run goal is to develop tools for computer-assisted document redaction.

One of the challenges is that the documents were created by many different parties using different software and techniques, so I'm probably going to have to look for a number of different patterns to find all (or at least most) of them. So I want the techniques I use to be as general as possible to make sure I catch as many documents as possible.

Thanks again for your help!