comment on

You don't want to do this, and I can't see anyone building such a tool that any business would want to spend money on. And, if you owned the document you are trying to do this with, then there is a obvious solution. Extracting text in order to search a PDF for content makes sense, converting PDF to HTML does not compute Will Robinson (Google for Lost In Space if the cliche is foreign to you).

I have built many PDF and Postscript files by hand and have a module I am planning on releasing in the near future, so, I understand the PDF and Postscript file formats. There are serveral objects that store text and images in a PDF file. The operators are not in any way associated with HTML tags, and they can appear in ANY order in the document because they have placement operators that are sometimes relative and sometimes absolute. You can do almost anything you can do in PostScript using these PDF operators. The PDF designer must specify everything( fontname, size, weight, rotation, scale, fill, color, linewidth, pattern, image, placement). It is a piece of cake to yank the text and images from a .pdf( well, almost, if the text is encoded in hex or another encoding then you have to do the massaging of that too. ) But, the real challenge is trying to determine what should be a Heading1 or a paragraph and making sure that the text is in the correct order (which would mean keeping track of the position on the page and translating relative paths to absolute, which would mean keeping track of the transformation matrix and more...).As a result, it is possible to extract text in a wrong order then what the author intended ( ie English vs Korean or Chinese ).

Look I could go on for pages about how you MIGHT accomplish some aspects of this( extracting the image ... guessing styles...) but none of them would be 100% accurate. And as many ways are there to design the PDF would have to be thought of in the reverse engineering of the HTML, which begs the whole question of the cost effectiveness of such an endeavor.

The other thing that strikes me as obvious is that you shouldn't be doing this because PDF 1.3 and higher docs have been optimized for viewing on the Web and have PDF viewers for most web browsers, have hyperlinks in them and even have forms and javascript capability built into them and can be searched by the pdf viewer app already! If you MUST have a HTML version of the document for an audience that cannot use the pdf plug-in (perhaps disabled or deaf?), then you should use the native application from which it was translated into PDF such as MS Word or Word Perfect that have predefined HTML layout templates. If you don't own the document, then you should not be doing this anyway without the authors permission, and I am certain if you have a good reason for needing it in another format, then you would benefit by letting the author provide you with ther approved versions.

Nuf said...
JamesNC

In reply to Re: How can I convert a pdf to html with PDF::Extract? by JamesNC
in thread Can I convert a pdf to html with PDF::Extract?? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.