so, i'd like to get more than text out when i parse through a pdf. ideally, i'll get a dom with marked up text and images of whatever couldn't be read into text. i've looked at quite a few pdf parsing modules. however, i'm having issues with what they output.

here's exactly what i need - the us government puts out proposed laws. and they nicely put them out in both pdf and text. however, this isn't really good enough for me because their text files have the same content that the pdf parsers do (pretty much).
for example: http://edocket.access.gpo.gov/2010/2010-26506.htm
is the text of this: http://edocket.access.gpo.gov/2010/pdf/2010-26506.pdf

so, where are the issues?
1. at the bottom of pdf page 26 of the pdf, there's a math equation that doesn't appear in their text. when i parse it, i get a bunch of useless 'stuff'. i'd either like mathml or an image (don't care which).
2. i can't figure out how to parse tables in a nice way. any ideas?

finally, i'm not one that wants to program just to be programming. if someone knows of someone who has done this or similar and is open source friendly, i'd love to know about it (i don't think this is the case but just figured i'd put this in).


In reply to parse pdf by ag4ve

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.