chrism01 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys
Is there a simple module that can dump a pdf file out to text? All the modules I've found seem to be for creating or manipulating pdf, which I am not interested in.
I started with the Unix util pdftotext, but it's output is sufficiently inaccurate that I can't really code around the problems.
Thx
Chris

UPDATED (SOLVED):
for anyone else who is trying to solve a similar problem, in the end I ran pdftops on the .pdf file, then ps2ascii on the .ps file and this gave me a .txt file I could handle in Perl.

Replies are listed 'Best First'.
Re: PDF to Text
by greenFox (Vicar) on Jan 24, 2005 at 08:37 UTC
      I have actually had a look at those modules, but all they do is create/manipulate pdfs. eg PDF::API2 has a fn $string = $pdf->stringify, but this just dumps the file into a string still as pdf format ie you get a load of binary rubbish.
      As for PDF::Extract - "Extracting sub PDF documents from a multi page PDF document"; again output is pdf.
      I just need the bare ascii text that pdftotext gives, except it has the odd random glitch which makes the output corrupted in terms of layout.
      If I can't predict the layout, I can't parse it.
Re: PDF to Text
by jbrugger (Parson) on Jan 24, 2005 at 08:31 UTC
    You could use pdftoall (not perl) or pdf2txt also not perl but shareware.
    for another good free util try pstotext (in debian apt-get install pstotext)
      I would note that the ps2ascii application which is installed as part of the pstotext package does also include PDF to ASCII conversion (although the manual page does note that this application does not consider font encoding and cannot handle kerning particularly well) - If there is a requirement for a Perl implementation of this solution, the source package for this application may be enlightening.

       

      perl -le "print unpack'N', pack'B32', '00000000000000000000001000000000'"

Re: PDF to Text
by aquarium (Curate) on Jan 24, 2005 at 09:44 UTC
    what is "sufficiently inaccurate"??
    i used pdftotext and it worked just fine in getting ascii text from many pdfs.
    the hardest line to type correctly is: stty erase ^H
      It works on most lines, but occasionally gets confused and outputs data in a different layout from the original.
      I really need it to be accurate because even the original is fiddly to deal with.
      It's basically 2 sets of columns of variable blocks of data, that also wrap around from the bottom of the left-hand column to the top of the right hand column on each page, then wraps to the top of the left-hand column on the next page etc ...
      eg (short example):
      name1 1,2,3,4 8,9,10 name3 1,2,3,4,5 name 1,2,34, name4 1,2,3,4 5,6,7,
      but what i sometimes get is:
      name1 1,2,3,4 8,9,10 name3 1,2,3,4,5 1,2,3,4 name4 name 1,2,34, 5,6,7
      Ther's a lot more of this ... also, the separations between names, nums, left, right cols are variable...