ag4ve has asked for the wisdom of the Perl Monks concerning the following question:

so, i'd like to get more than text out when i parse through a pdf. ideally, i'll get a dom with marked up text and images of whatever couldn't be read into text. i've looked at quite a few pdf parsing modules. however, i'm having issues with what they output.

here's exactly what i need - the us government puts out proposed laws. and they nicely put them out in both pdf and text. however, this isn't really good enough for me because their text files have the same content that the pdf parsers do (pretty much).
for example: http://edocket.access.gpo.gov/2010/2010-26506.htm
is the text of this: http://edocket.access.gpo.gov/2010/pdf/2010-26506.pdf

so, where are the issues?
1. at the bottom of pdf page 26 of the pdf, there's a math equation that doesn't appear in their text. when i parse it, i get a bunch of useless 'stuff'. i'd either like mathml or an image (don't care which).
2. i can't figure out how to parse tables in a nice way. any ideas?

finally, i'm not one that wants to program just to be programming. if someone knows of someone who has done this or similar and is open source friendly, i'd love to know about it (i don't think this is the case but just figured i'd put this in).

Replies are listed 'Best First'.
Re: parse pdf
by LanX (Saint) on Nov 06, 2010 at 00:32 UTC

      Another option is to use a converter to extract the text from the PDF.

      Poppler PDF rendering library

      On Ubuntu, the program you want is pdftotext in the package poppler-utils, installed with:

      sudo apt-get install poppler-utils

      pdftotext has several options which affect the formatting of the text output, so you should experiment with its options to see if you can improve on the text version you already have.

      I recently used pdftotext to successfully extract the text from a PDF with several hundred pages. YMMV

      It may be worth looking to see if there are other programs capable of extracting text.

        i'm hoping to go directly from getting the pdfs off the web to a sql file (which is why i really wanted something that might do dom from the structure of the file) so, predone pdf utilities aren't especially useful to me here.

        however, i seemed to have missed cam::pdf when searching for pdf modules which might be able to get me images of stuff that i can't format (i haven't looked, but i'm assuming that the math is in another type face - that, or i just weed it out with a regex) and then i should be able to get the object or line and have it output an image. (even though i said i didn't care, mathml would've been nice ;) )

Re: parse pdf
by patcat88 (Deacon) on Nov 07, 2010 at 05:07 UTC
    PDFs internally are similar to an XML tree, Adobe has the MARS project to create a zero loss PDF to XML-ish/semi-openish and back again format. But PDFs can NEVER be represented by a tree because they have references to a node creating circular paths in the tree ("Indirect Objects"). I've found this Acrobat addon very good at fully showing the PDF COS tree and allowing manual editing of the tree, http://www.windjack.com/product/pdfcanopener/, but its not a FOSS tool. From a quick look on CPAN, there are many libraries that will give you access to the PDF's COS tree. Not all PDFs can be parsed automatically by software. PDFs can be just an 8x11 scanned jpeg per page. A PDF's text might look as perfect vector graphics (zoom to 1600%), but its unhighlightable. I opened it in a PDF editor. EVERY character was made of dozens of vector graphics primitives. The file was made from Adobe Illustrator and somehow during the conversion, all the fonts turned into vector graphics and were not text anymore. Try extracting text if the letter 'a' is 10 rectangles and Bezier curves all as independent individually editable shapes. Your only choice might be to try OCRing it since there is no text in the COS tree.

    Since this is the government, try to think about "accessibility" support, researching those routes will get something that is supposed to be screen reader friendly, which always means computer parsable. Your text files without the formulas might be meeting ADA screen reader compatibility (I dont know), so you won't get anything better than that. The Federal Register is public domain, you can just copy the formula out of the PDF as a bitmap or as vector graphics into the destination without the computer ever understanding it.

    From a quick look at that PDF, all the forumlas are text, when on the same line, and same font and same font attributes. Sub/superscripts are done by making another text box with absolute positioning. The formulas are fundamentally unparsable. They are a bunch of absolute positioned text boxes. Sub/superscripts are done by making new boxes. Fraction lines are path shapes. OCR is your only hope but I dont think it will work for engineering formulas.

      yeah, i was hoping that the government would have made these documents 'nicer'. that said, i'm just going for a section by section search. so, i won't need to search the math. i'll just go through and parse until i see a new section and put it into a db with something like:

      doc uid, pdf page #, incremented num, section name, text
      and have another table with:
      doc uid, pdf data

      and just call the pdf with the page number when needed... this isn't making me much $$, so i haven't gotten back to it :(