Hi Monks

I know this topic has come up many times before in various flavours. I am sorry to bring it up again, however I need some specific help...

I am writing a script that requires me to extract the text from a number of PDFs that contain data in tabular (table) form. I have tried a number of approaches (CAM::PDF, pdftotext, pdftohtml etc.) however for various reasons none of these options has proved to be particularly suitable for me. Based on my attempts with the aforementioned approaches I have decided that what I am trying to do requires that I write custom code that works with the PDF on a stream level.

After spending a lot of time searching forums, reading code and experimenting with my own code I have been able to use PDF::API2 to access and decompress the content stream in a PDF. The next step is obviously to parse the stream to extract the text. I understand what this requires (in terms of the algorithm), however (as so often happens) I've hit a problem when dealing with the implementation...

The actual text content (characters) is encoded using fonts that are embedded in the PDF. I can't figure out how to use PDF::API2 to read the embedded font definitions so that I can use them to decode the text content. Can you please provide or direct me to an example that would help me.

Thanks in advance!

In reply to Extracting text from a PDF (using PDF::API2) by music_man1352000

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.