in reply to How to Extract PDF tables using Perl

The best advice I can give you is to use pdftohtml -xml and to parse the coordinates given in the xml output.

see also Parsing PDFs by text position?

The hard work - the heuristic to identify rows and colums - is yours.

Can't be done by us because we don't know the exact requirements and a Perl module can't be more intelligent than you are. ;-)

Good luck!

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^2: How to Extract PDF tables using Perl
by ateague (Monk) on May 11, 2016 at 14:08 UTC

    See my post here for an example that uses the pdftohtml.exe program LanX is referring to

    One caveat though: as LanX mentioned in his link, pdftohtml, under certain circumstances, may not break a tabular line up into its individual columns. Unfortunately this sort of thing is really dependent on the internal structure, version, content, and layout of the PDF. The perils of using a display format as data...

      Another point is that lines for borders will not be represented by pdftohtml, you have to go by text position only.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!