I have had some success with pdftohtml in the past.

It wasn't easy though. The tool has 2 major modes: I can't remember exactly what the problem was with the html mode, but I ended up not using it at all. I used the xml mode, with a LOT of post processing (in Perl).

For starters the XML was not valid (i, b, u and a tags where not properly nested), so I had to disentangle them. Then what you get is a bunch of strings with their position on the page. From there I had to order them, merge them to create lines (sub/super scripts needed to be handled of course), and then create paragraphs... fun!

That was with version 0.36, the one that seems to come with most Linux distributions (it was released in 2002). Sourceforge has some more recent ("experimental") versions. I tried 0.40a, which produced a wildly different output, at least in xml mode, and gave up. The problem with version 0.36 is that it has problems with some recent pdf (version 1.6).

Overall it was quite painful, but in the end I managed to extract some information from the files.

Obbly enough I am currently using pdftotext for an other project, and it seems to be doing quite well, even though of course the output is simpler than what pdftohtml produces. I haven't noticed it dropping letters so far.


In reply to Re: Extracting text from PDF. No really by mirod
in thread Extracting text from PDF. No really by clinton

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.