so, i'd like to get more than text out when i parse through a pdf. ideally, i'll get a dom with marked up text and images of whatever couldn't be read into text. i've looked at quite a few pdf parsing modules. however, i'm having issues with what they output.
here's exactly what i need - the us government puts out proposed laws. and they nicely put them out in both pdf and text. however, this isn't really good enough for me because their text files have the same content that the pdf parsers do (pretty much).
for example: http://edocket.access.gpo.gov/2010/2010-26506.htm
is the text of this: http://edocket.access.gpo.gov/2010/pdf/2010-26506.pdf
so, where are the issues?
1. at the bottom of pdf page 26 of the pdf, there's a math equation that doesn't appear in their text. when i parse it, i get a bunch of useless 'stuff'. i'd either like mathml or an image (don't care which).
2. i can't figure out how to parse tables in a nice way. any ideas?
finally, i'm not one that wants to program just to be programming. if someone knows of someone who has done this or similar and is open source friendly, i'd love to know about it (i don't think this is the case but just figured i'd put this in).
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |