an_ordinary_man has asked for the wisdom of the Perl Monks concerning the following question:

Hi All.
I want to extract information given in annotations (comments) inside PDF files, using regex.
When I open the PDF file in EditPlus it shows me a large file with about 10,000 lines containing the data I need & also lots of text that looks like junk, but when I read it through Perl it just reads about 29 lines & lots of the junk from PDF is missing.
I tried the texttopdf modules availabe, but they do not give me the contents in the annotations.
open (PDFTYPELOG, "+>pdftype.txt") or die "cannot open file errorl +og.txt for writing : $!"; open (FILE_HANDLE, "D:\\MyScripts\\Social008-017.pdf") or die "can +not Social008-017.pdf for reading : $!"; while ($LineContent = <FILE_HANDLE>) { print PDFTYPELOG ($LineContent); }
Please tell me how can I read the whole file line by line (if possible).
Regards.
An_Ordinary_Man

Replies are listed 'Best First'.
Re: Read PDF files & do regex through Perl.
by rjray (Chaplain) on Feb 11, 2002 at 23:47 UTC

    A PDF file is not a plain text file. It is a fairly complex binary format, so reading it with normal line-oriented I/O will not work.

    Look into the PDF-oriented modules on cpan (http://search.cpan.org/search?mode=module&query=PDF), or for PDF tools on Freshmeat.net, which you could use to pre-process the PDF, extracting the parts you want, which may then be handled by the Perl script.

    --rjray

Re: Read PDF files & do regex through Perl.
by beebware (Pilgrim) on Feb 11, 2002 at 23:48 UTC

    You may find the Adobe official PDF Reference books handy - they are available for free download (weighing in a 9Mb) from Adobe's website. Sanface have an early development version of a PDF-lib pdf comment extractor which may help guide you in the right direction.

    You may also find it useful to see how the code in the programs/scripts referenced from this article work, and see if you can 'tweak' it to extract the data instead of placing it in.

    I personally have had experience of a commercial ($40k) package which converted the PDF to raw XML data, but IIRC - even that didn't cope with the annotations in the file. I think the aforementioned PDF specification is your best bet.

Re: Read PDF files & do regex through Perl.
by YuckFoo (Abbot) on Feb 12, 2002 at 02:53 UTC
    When reading binary files on Windows, you need to set 'binmode' immediately after opening the file. You are reading (and processing) a false 'end-of-file' after 29 lines.

    For more information, see 'perldoc -f binmode'.

    Then you can figure out how to process the binary data.

    YuckFoo

Re: Read PDF files & do regex through Perl.
by aj (Initiate) on Feb 12, 2002 at 10:24 UTC
    here is a place to get pdf files converted to html and or text. Just attach the required files and email it to all 3 places. You will get back a .txt file, and an .html file. Please note: "locked" files cannot be opened unless you use a "special" program..... write me: moic@mail.com and I can give you the details.... mailto: Site Address http://24.182.240.51/pdf/HiPdProf.pdf pdf2html@adobe.com convert by email to html pdf2txt@adobe.com converts to .txt by email