comment on

PDFs internally are similar to an XML tree, Adobe has the MARS project to create a zero loss PDF to XML-ish/semi-openish and back again format. But PDFs can NEVER be represented by a tree because they have references to a node creating circular paths in the tree ("Indirect Objects"). I've found this Acrobat addon very good at fully showing the PDF COS tree and allowing manual editing of the tree, http://www.windjack.com/product/pdfcanopener/, but its not a FOSS tool. From a quick look on CPAN, there are many libraries that will give you access to the PDF's COS tree. Not all PDFs can be parsed automatically by software. PDFs can be just an 8x11 scanned jpeg per page. A PDF's text might look as perfect vector graphics (zoom to 1600%), but its unhighlightable. I opened it in a PDF editor. EVERY character was made of dozens of vector graphics primitives. The file was made from Adobe Illustrator and somehow during the conversion, all the fonts turned into vector graphics and were not text anymore. Try extracting text if the letter 'a' is 10 rectangles and Bezier curves all as independent individually editable shapes. Your only choice might be to try OCRing it since there is no text in the COS tree.

Since this is the government, try to think about "accessibility" support, researching those routes will get something that is supposed to be screen reader friendly, which always means computer parsable. Your text files without the formulas might be meeting ADA screen reader compatibility (I dont know), so you won't get anything better than that. The Federal Register is public domain, you can just copy the formula out of the PDF as a bitmap or as vector graphics into the destination without the computer ever understanding it.

From a quick look at that PDF, all the forumlas are text, when on the same line, and same font and same font attributes. Sub/superscripts are done by making another text box with absolute positioning. The formulas are fundamentally unparsable. They are a bunch of absolute positioned text boxes. Sub/superscripts are done by making new boxes. Fraction lines are path shapes. OCR is your only hope but I dont think it will work for engineering formulas.

In reply to Re: parse pdf by patcat88
in thread parse pdf by ag4ve

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.