PDF to Text

chrism01 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: PDF to Text by greenFox (Vicar) on Jan 24, 2005 at 08:37 UTC
I don't know the answer to your question but a Super Seach for convert pdf to text reveals quite a few nodes on this topic including (as a quick sample): Can I convert a pdf to html with PDF::Extract??, pdf2txt?, Extract text from PDF and Reading PDF files. A quick skim through those nodes suggests the following modules might help: PDF::Extract and PDF::API2. Searching for pdf on CPAN reveals a few more potential candidates. Good luck and do let us know how you get on :) -- Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho	[reply]
Re^2: PDF to Text by chrism01 (Friar) on Jan 27, 2005 at 01:31 UTC
I have actually had a look at those modules, but all they do is create/manipulate pdfs. eg PDF::API2 has a fn $string = $pdf->stringify, but this just dumps the file into a string still as pdf format ie you get a load of binary rubbish. As for PDF::Extract - "Extracting sub PDF documents from a multi page PDF document"; again output is pdf. I just need the bare ascii text that pdftotext gives, except it has the odd random glitch which makes the output corrupted in terms of layout. If I can't predict the layout, I can't parse it.	[reply]
Re: PDF to Text by jbrugger (Parson) on Jan 24, 2005 at 08:31 UTC
You could use pdftoall (not perl) or pdf2txt also not perl but shareware. for another good free util try pstotext (in debian apt-get install pstotext)	[reply]
Re^2: PDF to Text by rob_au (Abbot) on Jan 24, 2005 at 10:29 UTC
I would note that the ps2ascii application which is installed as part of the pstotext package does also include PDF to ASCII conversion (although the manual page does note that this application does not consider font encoding and cannot handle kerning particularly well) - If there is a requirement for a Perl implementation of this solution, the source package for this application may be enlightening. `perl -le "print unpack'N', pack'B32', '00000000000000000000001000000000'"`	[reply]
Re: PDF to Text by aquarium (Curate) on Jan 24, 2005 at 09:44 UTC
what is "sufficiently inaccurate"?? i used pdftotext and it worked just fine in getting ascii text from many pdfs. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: PDF to Text by chrism01 (Friar) on Jan 27, 2005 at 01:47 UTC
It works on most lines, but occasionally gets confused and outputs data in a different layout from the original. I really need it to be accurate because even the original is fiddly to deal with. It's basically 2 sets of columns of variable blocks of data, that also wrap around from the bottom of the left-hand column to the top of the right hand column on each page, then wraps to the top of the left-hand column on the next page etc ... eg (short example): `name1 1,2,3,4 8,9,10 name3 1,2,3,4,5 name 1,2,34, name4 1,2,3,4 5,6,7,` [download] but what i sometimes get is: `name1 1,2,3,4 8,9,10 name3 1,2,3,4,5 1,2,3,4 name4 name 1,2,34, 5,6,7` [download] Ther's a lot more of this ... also, the separations between names, nums, left, right cols are variable...	[reply] [d/l] [select]