pdf to html

mouleeshmichael has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: pdf to html by Anonyrnous Monk (Hermit) on Dec 23, 2010 at 10:27 UTC
Almost the same question has been asked recently: Convert PDF file into HTML file. Maybe you find some ideas there.	[reply]
Re: pdf to html by roboticus (Chancellor) on Dec 23, 2010 at 13:38 UTC
mouleeshmichael: <snarky>This bit will convert a text file to HTML for you: `#!/usr/bin/perl use strict; use warnings; use File::Slurp; use HTML::Entities; my $INFname = shift or die "Missing input file name!"; my $text = read_file($INFname); $text = encode_entities($text); open my $OUTF, '>', $INFname . ".html", or die "Can't open $INFname.ht +ml: $!\n"; print $OUTF "<html><body>$text</body></html>\n";` [download] </snarky>To help you further, we would have to know where you are, and where you're going. You've provided so little information, that it would be difficult to figure out just what you're having trouble with. It's a pretty trivial task to convert text to HTML, so I rather doubt that the code provided is what you're looking for, but it I believe that it meets all the requirements you listed. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re: pdf to html by CountZero (Bishop) on Dec 23, 2010 at 16:32 UTC
All depends on how much of the PDF file you will want to keep: just the text or also the whole of the formatting? I do not think there is any easy way to transfer the formatting from a PDF file to a HTML file, unless you transform the PDF file into a picture (jpeg or such) and include that in your HTML. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: pdf to html by snobol (Initiate) on Dec 23, 2010 at 23:39 UTC
I always used pdftohtml: `http://pdftohtml.sourceforge.net/` Then I parsed the HTML for content with HTML::TreeBuilder::XPath. This works particularly well for simple documents, or documents with a standardized structure. You can look for the x/y offset of the element to find the exact piece of information you're looking for.	[reply] [d/l]