in reply to parse pdf

see Parsing PDFs by text position? and included links for a start. HTH!

Cheers Rolf

Replies are listed 'Best First'.
Re^2: parse pdf
by tod222 (Pilgrim) on Nov 06, 2010 at 01:30 UTC

    Another option is to use a converter to extract the text from the PDF.

    Poppler PDF rendering library

    On Ubuntu, the program you want is pdftotext in the package poppler-utils, installed with:

    sudo apt-get install poppler-utils

    pdftotext has several options which affect the formatting of the text output, so you should experiment with its options to see if you can improve on the text version you already have.

    I recently used pdftotext to successfully extract the text from a PDF with several hundred pages. YMMV

    It may be worth looking to see if there are other programs capable of extracting text.

      i'm hoping to go directly from getting the pdfs off the web to a sql file (which is why i really wanted something that might do dom from the structure of the file) so, predone pdf utilities aren't especially useful to me here.

      however, i seemed to have missed cam::pdf when searching for pdf modules which might be able to get me images of stuff that i can't format (i haven't looked, but i'm assuming that the math is in another type face - that, or i just weed it out with a regex) and then i should be able to get the object or line and have it output an image. (even though i said i didn't care, mathml would've been nice ;) )