Ninth Prince has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

Investment advisers have to file a particular form with the Securities and Exchange Commission. The form is a pdf file with a bunch of questions involving checkbox yes/no answers. An example of a filing can be found at http://www.adviserinfo.sec.gov/Iapd/Content/Common/crd_iapd_Brochure.aspx?BRCHR_VRSN_ID=17814.

What I would like to be able to do is record advisers' answers to various questions on the form. Is there a good way to do this (module?) using PERL?

Thanks.

  • Comment on Reading answers to questions in pdf file

Replies are listed 'Best First'.
Re: Reading answers to questions in pdf file
by Fletch (Bishop) on Oct 28, 2008 at 19:25 UTC

    As a general answer, depending on the PDF you might get acceptable results from running the files through the pdftotext utility that comes with xpdf and then processing the resulting text. Your milage will vary though with the PDF and how lucky you get with how the text comes out.

    In this particular case though it doesn't look like the checks and values are coming through from your sample document unfortunately, so you're probably going to have to look elsewhere.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Reading answers to questions in pdf file
by Anonymous Monk on Oct 28, 2008 at 22:11 UTC
    I tried parsing that file with CAM::PDF with no luck. If you save it was txt from the acrobat reader, the values do appear, but they are not even close in position to the questions in the text file so it is almost impossible. The SEC loves these PDF files. They post a 13-F securities file that is over 400 pages long that has the same problems as you are dealing with. There is a boat load of embedded tables that makes it impossible to parse. I guess if enough people complained, they would make a text version who knows.