Re: parse pdf

PDFs internally are similar to an XML tree, Adobe has the MARS project to create a zero loss PDF to XML-ish/semi-openish and back again format. But PDFs can NEVER be represented by a tree because they have references to a node creating circular paths in the tree ("Indirect Objects"). I've found this Acrobat addon very good at fully showing the PDF COS tree and allowing manual editing of the tree, http://www.windjack.com/product/pdfcanopener/, but its not a FOSS tool. From a quick look on CPAN, there are many libraries that will give you access to the PDF's COS tree. Not all PDFs can be parsed automatically by software. PDFs can be just an 8x11 scanned jpeg per page. A PDF's text might look as perfect vector graphics (zoom to 1600%), but its unhighlightable. I opened it in a PDF editor. EVERY character was made of dozens of vector graphics primitives. The file was made from Adobe Illustrator and somehow during the conversion, all the fonts turned into vector graphics and were not text anymore. Try extracting text if the letter 'a' is 10 rectangles and Bezier curves all as independent individually editable shapes. Your only choice might be to try OCRing it since there is no text in the COS tree.

Since this is the government, try to think about "accessibility" support, researching those routes will get something that is supposed to be screen reader friendly, which always means computer parsable. Your text files without the formulas might be meeting ADA screen reader compatibility (I dont know), so you won't get anything better than that. The Federal Register is public domain, you can just copy the formula out of the PDF as a bitmap or as vector graphics into the destination without the computer ever understanding it.

From a quick look at that PDF, all the forumlas are text, when on the same line, and same font and same font attributes. Sub/superscripts are done by making another text box with absolute positioning. The formulas are fundamentally unparsable. They are a bunch of absolute positioned text boxes. Sub/superscripts are done by making new boxes. Fraction lines are path shapes. OCR is your only hope but I dont think it will work for engineering formulas.

Comment on Re: parse pdf

Replies are listed 'Best First'.
Re^2: parse pdf by ag4ve (Monk) on Nov 14, 2010 at 09:01 UTC
yeah, i was hoping that the government would have made these documents 'nicer'. that said, i'm just going for a section by section search. so, i won't need to search the math. i'll just go through and parse until i see a new section and put it into a db with something like: `doc uid, pdf page #, incremented num, section name, text` [download] and have another table with: `doc uid, pdf data` [download] and just call the pdf with the page number when needed... this isn't making me much $$, so i haven't gotten back to it :(	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: parse pdf
by ag4ve (Monk) on Nov 14, 2010 at 09:01 UTC

yeah, i was hoping that the government would have made these documents 'nicer'. that said, i'm just going for a section by section search. so, i won't need to search the math. i'll just go through and parse until i see a new section and put it into a db with something like:

doc uid, pdf page #, incremented num, section name, text
[download]

doc uid, pdf data
[download]

and just call the pdf with the page number when needed... this isn't making me much $$, so i haven't gotten back to it :(

[reply]
[d/l]
[select]