Re: How to Extract PDF tables using Perl

The best advice I can give you is to use pdftohtml -xml and to parse the coordinates given in the xml output.

The hard work - the heuristic to identify rows and colums - is yours.

Can't be done by us because we don't know the exact requirements and a Perl module can't be more intelligent than you are. ;-)

Good luck!

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

Comment on Re: How to Extract PDF tables using Perl Download Code

Replies are listed 'Best First'.
Re^2: How to Extract PDF tables using Perl by ateague (Monk) on May 11, 2016 at 14:08 UTC
See my post here for an example that uses the pdftohtml.exe program LanX is referring to One caveat though: as LanX mentioned in his link, pdftohtml, under certain circumstances, may not break a tabular line up into its individual columns. Unfortunately this sort of thing is really dependent on the internal structure, version, content, and layout of the PDF. The perils of using a display format as data...	[reply]
Re^3: How to Extract PDF tables using Perl by LanX (Saint) on May 11, 2016 at 15:41 UTC
Another point is that lines for borders will not be represented by pdftohtml, you have to go by text position only. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]