in reply to Can I convert a pdf to html with PDF::Extract??

You don't want to do this, and I can't see anyone building such a tool that any business would want to spend money on. And, if you owned the document you are trying to do this with, then there is a obvious solution. Extracting text in order to search a PDF for content makes sense, converting PDF to HTML does not compute Will Robinson (Google for Lost In Space if the cliche is foreign to you).

I have built many PDF and Postscript files by hand and have a module I am planning on releasing in the near future, so, I understand the PDF and Postscript file formats. There are serveral objects that store text and images in a PDF file. The operators are not in any way associated with HTML tags, and they can appear in ANY order in the document because they have placement operators that are sometimes relative and sometimes absolute. You can do almost anything you can do in PostScript using these PDF operators. The PDF designer must specify everything( fontname, size, weight, rotation, scale, fill, color, linewidth, pattern, image, placement). It is a piece of cake to yank the text and images from a .pdf( well, almost, if the text is encoded in hex or another encoding then you have to do the massaging of that too. ) But, the real challenge is trying to determine what should be a Heading1 or a paragraph and making sure that the text is in the correct order (which would mean keeping track of the position on the page and translating relative paths to absolute, which would mean keeping track of the transformation matrix and more...).As a result, it is possible to extract text in a wrong order then what the author intended ( ie English vs Korean or Chinese ).

Look I could go on for pages about how you MIGHT accomplish some aspects of this( extracting the image ... guessing styles...) but none of them would be 100% accurate. And as many ways are there to design the PDF would have to be thought of in the reverse engineering of the HTML, which begs the whole question of the cost effectiveness of such an endeavor.

The other thing that strikes me as obvious is that you shouldn't be doing this because PDF 1.3 and higher docs have been optimized for viewing on the Web and have PDF viewers for most web browsers, have hyperlinks in them and even have forms and javascript capability built into them and can be searched by the pdf viewer app already! If you MUST have a HTML version of the document for an audience that cannot use the pdf plug-in (perhaps disabled or deaf?), then you should use the native application from which it was translated into PDF such as MS Word or Word Perfect that have predefined HTML layout templates. If you don't own the document, then you should not be doing this anyway without the authors permission, and I am certain if you have a good reason for needing it in another format, then you would benefit by letting the author provide you with ther approved versions.

Nuf said...
JamesNC
  • Comment on Re: How can I convert a pdf to html with PDF::Extract?

Replies are listed 'Best First'.
Re: Re: How can I convert a pdf to html with PDF::Extract?
by zengargoyle (Deacon) on Nov 28, 2003 at 23:14 UTC

    thank you JamesNC for answering why it's so hard to do and why most would be better served converting to HTML from the source with the same proggy that creates the PDF.

    anyways, on a larkish whim i tried this last night:

    $ mkdir ~/public_html/pdf_test $ cd ~/public_html/pdf_test $ convert ~/PDF/ch10.pdf ch10_%02d.html

    and imagine my surprise when it actually worked! but in truth, it doesn't work that well, it does the obvious of PDF -> image -> gif, and the html files are just wrappers to load the gif images of the pages. but, if it's thumbnails of PDF you want then with a bit of scaling this will work. but it's more of a pain to read a giant gif than to download the PDF and use Acrobat.

    if you really want HTML/GIF versions of a PDF then take a look at ImageMagick which provides lot's of conversion options.