Re^2: parse pdf

Another option is to use a converter to extract the text from the PDF.

On Ubuntu, the program you want is pdftotext in the package poppler-utils, installed with:

sudo apt-get install poppler-utils
[download]

pdftotext has several options which affect the formatting of the text output, so you should experiment with its options to see if you can improve on the text version you already have.

I recently used pdftotext to successfully extract the text from a PDF with several hundred pages. YMMV

It may be worth looking to see if there are other programs capable of extracting text.

Comment on Re^2: parse pdf Download Code

Replies are listed 'Best First'.
Re^3: parse pdf by ag4ve (Monk) on Nov 06, 2010 at 01:51 UTC
i'm hoping to go directly from getting the pdfs off the web to a sql file (which is why i really wanted something that might do dom from the structure of the file) so, predone pdf utilities aren't especially useful to me here. however, i seemed to have missed cam::pdf when searching for pdf modules which might be able to get me images of stuff that i can't format (i haven't looked, but i'm assuming that the math is in another type face - that, or i just weed it out with a regex) and then i should be able to get the object or line and have it output an image. (even though i said i didn't care, mathml would've been nice ;) )	[reply]

Replies are listed 'Best First'.

Re^3: parse pdf
by ag4ve (Monk) on Nov 06, 2010 at 01:51 UTC

i'm hoping to go directly from getting the pdfs off the web to a sql file (which is why i really wanted something that might do dom from the structure of the file) so, predone pdf utilities aren't especially useful to me here.

however, i seemed to have missed cam::pdf when searching for pdf modules which might be able to get me images of stuff that i can't format (i haven't looked, but i'm assuming that the math is in another type face - that, or i just weed it out with a regex) and then i should be able to get the object or line and have it output an image. (even though i said i didn't care, mathml would've been nice ;) )

[reply]