in reply to Re: Extracting tables from PDF
in thread Extracting tables from PDF

Thanks for the pointer. The project name is actually pdftohtml, rather than the digit 2, but close enough to find easily.

Unfortunately, this is still pretty ugly... Tables do end up displaying properly in "complex document" mode, but that's just because it puts every element in a <div> and positions it with style=position:absolute. Whether it's in normal mode or complex document mode, there's nary a <table> tag in sight.

I also found a message in one of the project forums where the author tells someone else,

There is no concept of tables in PDF. When you see a table in a PDF file, it's just a bunch of text positioned in particular places and a bunch of lines. There is no simple way to translate tables from PDF to HTML or anything else.
Granted, the post was from mid-2004, but, unless that's changed, this looks very not-promising.

Replies are listed 'Best First'.
Re^3: Extracting tables from PDF
by Dervish (Friar) on Jul 13, 2007 at 04:39 UTC
    That's exactly right; PDF describes how to position elements on the page, but it doesn't have any built-in concept of a 'table'. As a result, any PDF writer is free to do anything from create its own table command to individually positioning each character in the table in any order, with arbitrary commands in between each table entry.

    The formatting commands above actually mimic the most common PDF code fairly well.
Re^3: Extracting tables from PDF
by Anonymous Monk on Jul 13, 2007 at 04:11 UTC
    That wouldn't have changed. PDF is not html.
      TeX is not HTML. TeX has tables.

      RTF is not HTML. RTF has tables.

      Etc. Neither the concept of tables nor the presence of syntax indicating "this is a table" are unique to HTML. (And I really don't care about HTML here anyhow, just a way to identify the tables in a PDF and suck the data out of them.)