Read PDF files & do regex through Perl.

an_ordinary_man has asked for the wisdom of the Perl Monks concerning the following question:

Hi All.
I want to extract information given in annotations (comments) inside PDF files, using regex.
When I open the PDF file in EditPlus it shows me a large file with about 10,000 lines containing the data I need & also lots of text that looks like junk, but when I read it through Perl it just reads about 29 lines & lots of the junk from PDF is missing.
I tried the texttopdf modules availabe, but they do not give me the contents in the annotations.

    open (PDFTYPELOG, "+>pdftype.txt") or die "cannot open file errorl
+og.txt for writing : $!";
    open (FILE_HANDLE, "D:\\MyScripts\\Social008-017.pdf") or die "can
+not Social008-017.pdf for reading : $!";
    while ($LineContent = <FILE_HANDLE>)
    {
        print PDFTYPELOG ($LineContent);
    }
[download]

Please tell me how can I read the whole file line by line (if possible).
Regards.
An_Ordinary_Man

Comment on Read PDF files & do regex through Perl. Download Code

Replies are listed 'Best First'.
Re: Read PDF files & do regex through Perl. by rjray (Chaplain) on Feb 11, 2002 at 23:47 UTC
A PDF file is not a plain text file. It is a fairly complex binary format, so reading it with normal line-oriented I/O will not work. Look into the PDF-oriented modules on cpan (http://search.cpan.org/search?mode=module&query=PDF), or for PDF tools on Freshmeat.net, which you could use to pre-process the PDF, extracting the parts you want, which may then be handled by the Perl script. --rjray	[reply]
Re: Read PDF files & do regex through Perl. by beebware (Pilgrim) on Feb 11, 2002 at 23:48 UTC
You may find the Adobe official PDF Reference books handy - they are available for free download (weighing in a 9Mb) from Adobe's website. Sanface have an early development version of a PDF-lib pdf comment extractor which may help guide you in the right direction. You may also find it useful to see how the code in the programs/scripts referenced from this article work, and see if you can 'tweak' it to extract the data instead of placing it in. I personally have had experience of a commercial ($40k) package which converted the PDF to raw XML data, but IIRC - even that didn't cope with the annotations in the file. I think the aforementioned PDF specification is your best bet.	[reply]
Re: Read PDF files & do regex through Perl. by YuckFoo (Abbot) on Feb 12, 2002 at 02:53 UTC
When reading binary files on Windows, you need to set 'binmode' immediately after opening the file. You are reading (and processing) a false 'end-of-file' after 29 lines. For more information, see 'perldoc -f binmode'. Then you can figure out how to process the binary data. YuckFoo	[reply]
Re: Read PDF files & do regex through Perl. by aj (Initiate) on Feb 12, 2002 at 10:24 UTC
here is a place to get pdf files converted to html and or text. Just attach the required files and email it to all 3 places. You will get back a .txt file, and an .html file. Please note: "locked" files cannot be opened unless you use a "special" program..... write me: moic@mail.com and I can give you the details.... mailto: Site Address http://24.182.240.51/pdf/HiPdProf.pdf pdf2html@adobe.com convert by email to html pdf2txt@adobe.com converts to .txt by email	[reply]