mouleeshmichael has asked for the wisdom of the Perl Monks concerning the following question:

Hi i want to convert pdf files to html.i converted pdf to text.how to convert those text to html automatically? i used the module CAM::PDF. Help me further.

Replies are listed 'Best First'.
Re: pdf to html
by Anonyrnous Monk (Hermit) on Dec 23, 2010 at 10:27 UTC
Re: pdf to html
by roboticus (Chancellor) on Dec 23, 2010 at 13:38 UTC

    mouleeshmichael:

    <snarky>This bit will convert a text file to HTML for you:

    #!/usr/bin/perl use strict; use warnings; use File::Slurp; use HTML::Entities; my $INFname = shift or die "Missing input file name!"; my $text = read_file($INFname); $text = encode_entities($text); open my $OUTF, '>', $INFname . ".html", or die "Can't open $INFname.ht +ml: $!\n"; print $OUTF "<html><body>$text</body></html>\n";

    </snarky>To help you further, we would have to know where you are, and where you're going. You've provided so little information, that it would be difficult to figure out just what you're having trouble with. It's a pretty trivial task to convert text to HTML, so I rather doubt that the code provided is what you're looking for, but it I believe that it meets all the requirements you listed.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: pdf to html
by CountZero (Bishop) on Dec 23, 2010 at 16:32 UTC
    All depends on how much of the PDF file you will want to keep: just the text or also the whole of the formatting?

    I do not think there is any easy way to transfer the formatting from a PDF file to a HTML file, unless you transform the PDF file into a picture (jpeg or such) and include that in your HTML.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: pdf to html
by snobol (Initiate) on Dec 23, 2010 at 23:39 UTC

    I always used pdftohtml:

    http://pdftohtml.sourceforge.net/

    Then I parsed the HTML for content with HTML::TreeBuilder::XPath. This works particularly well for simple documents, or documents with a standardized structure. You can look for the x/y offset of the element to find the exact piece of information you're looking for.