parse pdf

ag4ve has asked for the wisdom of the Perl Monks concerning the following question:

so, i'd like to get more than text out when i parse through a pdf. ideally, i'll get a dom with marked up text and images of whatever couldn't be read into text. i've looked at quite a few pdf parsing modules. however, i'm having issues with what they output.

here's exactly what i need - the us government puts out proposed laws. and they nicely put them out in both pdf and text. however, this isn't really good enough for me because their text files have the same content that the pdf parsers do (pretty much).
for example: http://edocket.access.gpo.gov/2010/2010-26506.htm
is the text of this: http://edocket.access.gpo.gov/2010/pdf/2010-26506.pdf

so, where are the issues?
1. at the bottom of pdf page 26 of the pdf, there's a math equation that doesn't appear in their text. when i parse it, i get a bunch of useless 'stuff'. i'd either like mathml or an image (don't care which).
2. i can't figure out how to parse tables in a nice way. any ideas?

finally, i'm not one that wants to program just to be programming. if someone knows of someone who has done this or similar and is open source friendly, i'd love to know about it (i don't think this is the case but just figured i'd put this in).

Comment on parse pdf Select or Download Code

Replies are listed 'Best First'.
Re: parse pdf by LanX (Saint) on Nov 06, 2010 at 00:32 UTC
see Parsing PDFs by text position? and included links for a start. HTH! Cheers Rolf	[reply]
Re^2: parse pdf by tod222 (Pilgrim) on Nov 06, 2010 at 01:30 UTC
Another option is to use a converter to extract the text from the PDF. Poppler PDF rendering library On Ubuntu, the program you want is pdftotext in the package poppler-utils, installed with: `sudo apt-get install poppler-utils` [download] pdftotext has several options which affect the formatting of the text output, so you should experiment with its options to see if you can improve on the text version you already have. I recently used pdftotext to successfully extract the text from a PDF with several hundred pages. YMMV It may be worth looking to see if there are other programs capable of extracting text.	[reply] [d/l]
Re^3: parse pdf by ag4ve (Monk) on Nov 06, 2010 at 01:51 UTC
i'm hoping to go directly from getting the pdfs off the web to a sql file (which is why i really wanted something that might do dom from the structure of the file) so, predone pdf utilities aren't especially useful to me here. however, i seemed to have missed cam::pdf when searching for pdf modules which might be able to get me images of stuff that i can't format (i haven't looked, but i'm assuming that the math is in another type face - that, or i just weed it out with a regex) and then i should be able to get the object or line and have it output an image. (even though i said i didn't care, mathml would've been nice ;) )	[reply]
Re: parse pdf by patcat88 (Deacon) on Nov 07, 2010 at 05:07 UTC
PDFs internally are similar to an XML tree, Adobe has the MARS project to create a zero loss PDF to XML-ish/semi-openish and back again format. But PDFs can NEVER be represented by a tree because they have references to a node creating circular paths in the tree ("Indirect Objects"). I've found this Acrobat addon very good at fully showing the PDF COS tree and allowing manual editing of the tree, http://www.windjack.com/product/pdfcanopener/, but its not a FOSS tool. From a quick look on CPAN, there are many libraries that will give you access to the PDF's COS tree. Not all PDFs can be parsed automatically by software. PDFs can be just an 8x11 scanned jpeg per page. A PDF's text might look as perfect vector graphics (zoom to 1600%), but its unhighlightable. I opened it in a PDF editor. EVERY character was made of dozens of vector graphics primitives. The file was made from Adobe Illustrator and somehow during the conversion, all the fonts turned into vector graphics and were not text anymore. Try extracting text if the letter 'a' is 10 rectangles and Bezier curves all as independent individually editable shapes. Your only choice might be to try OCRing it since there is no text in the COS tree. Since this is the government, try to think about "accessibility" support, researching those routes will get something that is supposed to be screen reader friendly, which always means computer parsable. Your text files without the formulas might be meeting ADA screen reader compatibility (I dont know), so you won't get anything better than that. The Federal Register is public domain, you can just copy the formula out of the PDF as a bitmap or as vector graphics into the destination without the computer ever understanding it. From a quick look at that PDF, all the forumlas are text, when on the same line, and same font and same font attributes. Sub/superscripts are done by making another text box with absolute positioning. The formulas are fundamentally unparsable. They are a bunch of absolute positioned text boxes. Sub/superscripts are done by making new boxes. Fraction lines are path shapes. OCR is your only hope but I dont think it will work for engineering formulas.	[reply]
Re^2: parse pdf by ag4ve (Monk) on Nov 14, 2010 at 09:01 UTC
yeah, i was hoping that the government would have made these documents 'nicer'. that said, i'm just going for a section by section search. so, i won't need to search the math. i'll just go through and parse until i see a new section and put it into a db with something like: `doc uid, pdf page #, incremented num, section name, text` [download] and have another table with: `doc uid, pdf data` [download] and just call the pdf with the page number when needed... this isn't making me much $$, so i haven't gotten back to it :(	[reply] [d/l] [select]