I am not aware of a CPAN-module that offers a kind of extract_table(page => 42, row => 1, column => 3); method. Creating that wouldn't be easy since the PDF-operators a more like plotter commands plotting on a sheet of paper, so there is no markup like a <TABLE> in HTML which defines some embedded object.

Are your PDF files generated automatically, that is to say in a repeatable fashion? I once managed to extract table based information from a series of automatically generated PDF files after converting them into Postscript using pdftops (not: pdf2ps) and some heuristics. Quite a game of chance... but maybe it works for you too?

Same approach: CAM::PDF comes with a tool rewritepdf.pl which allows to decompress the internal object streams (-d switch). Analysing the decompressed PDF file might give some hints. A typical table ENTRY might be embedded like this:

40 0 Td          <-- x, y position (Td: goto text position)
(ENTRY)Tj        <-- ENTRY         (Tj: show text
The Wikipedia entry for PDF provides a link to "Portable Document Format: An Introduction for Programmers" which provides a lightweight introduction and a table with common PDF operators.

Update: argl, it's rewritepdf.pl


In reply to Re: Extracting information from a PDF file by Perlbotics
in thread [Updated] Extracting information from a PDF file by Lawliet

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.