Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, How can I use PDF::Extract perl module to save a .pdf file into an .html file . Is it possible with this module . Thankyou !

20031128 Edit by Corion: Changed title from 'PDF::Extract'

  • Comment on Can I convert a pdf to html with PDF::Extract??

Replies are listed 'Best First'.
Re: How can I convert a pdf to html with PDF::Extract?
by davido (Cardinal) on Nov 28, 2003 at 08:24 UTC
    There's a great example snippet in the POD for PDF::Extract:

    use PDF::Extract; $pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' ); $pdf->getPDFExtract( PDFPages=>$PDFPages ); print "Content-Type text/plain\n\n<xmp>", $pdf->getVars("PDFExtract" +); print $pdf->getVars("PDFError"); # or # Extract and save, in the current directory, all the pages in a pdf +document use PDF::Extract; $pdf=new PDF::Extract( PDFDoc=>"test.pdf"); $i=1; $i++ while ( $pdf->savePDFExtract( PDFPages=>$i ) );

    Update to provide more thorough information:

    The POD also discusses the following:

    With PDF::Extract a new PDF document can be:-

    • assigned to a scalar variable with getPDFExtract.
    • saved to disk with savePDFExtract.
    • printed to STDOUT as a PDF web document with servePDFExtract.
    • cached and served for a faster PDF web document service with fastServePDFExtract.

    So I guess the short answer is that yes, this is an appropriate tool for the job. The example under the heading servePDFExtract shows how to output to STDOUT with the correct header for a PDF document served on the web.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      That's amazing, how did you know to read the documentation?
      Hi, You mean to say that in the subroutines of savePDFExtract by changing the .pdf extension to .html will do the task for me . Can u explain in detail how it can be achieved as per ur sayings . Thankyou !
        No, I mean to say that the following snippet will output a complete PDF content header, followed by the PDF document that you've extracted, for the webserver to serve up to the HTTP client.

        $pdf = PDF::Extract->new( PDFDoc=>'C:/my.pdf', PDFErrorPage=>"C:/myErrorPage.html" ); $pdf->servePDFExtract( PDFPages=>1);


        Dave


        "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      Hi, When I use servePDFExtract and output to STDOUT will it help me to get the form in html . When I do that I get some output which is absurd and not in plain nor html form . Can you just let me know how I can achieve the task to get the pdf file converted to html file along with graphics. Please direct me in this task with a few samples of code to chieve it . Thankyou !
        I understood you as needing to extract and display a PDF page on a web browser, and for that, the PDF::Extract module is on target. HTML::HTMLDoc::PDF would also be appropriate for that use. But for file format conversion, from PDF to HTML, which I'm now understanding is what you're trying to do, that's a different story, and a bit more difficult.

        You could use PDF::Parse to disect the PDF file. But turning the output of PDF::Parse into HTML is work. Especially considering that some PDF files can be encrypted, rendering many of the PDF::Parse functions useless.

        There are programs out there already that do the conversion for you, without requiring you to toil over trying to roll your own converter. this seems to be one possibility. There is another one here. The second one listed here is shareware with a free trial period. And Googling for "pdf to html conversion" turns up droves.

        Of course they're not geared for doing it "on the fly". But it's also not a quick, on-the-fly type of process. Converting dynamically, on demand, would drive server load through the roof. It's something best done once, and for that, why not use an already completed solution?


        Dave


        "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: How can I convert a pdf to html with PDF::Extract?
by JamesNC (Chaplain) on Nov 28, 2003 at 18:35 UTC
    You don't want to do this, and I can't see anyone building such a tool that any business would want to spend money on. And, if you owned the document you are trying to do this with, then there is a obvious solution. Extracting text in order to search a PDF for content makes sense, converting PDF to HTML does not compute Will Robinson (Google for Lost In Space if the cliche is foreign to you).

    I have built many PDF and Postscript files by hand and have a module I am planning on releasing in the near future, so, I understand the PDF and Postscript file formats. There are serveral objects that store text and images in a PDF file. The operators are not in any way associated with HTML tags, and they can appear in ANY order in the document because they have placement operators that are sometimes relative and sometimes absolute. You can do almost anything you can do in PostScript using these PDF operators. The PDF designer must specify everything( fontname, size, weight, rotation, scale, fill, color, linewidth, pattern, image, placement). It is a piece of cake to yank the text and images from a .pdf( well, almost, if the text is encoded in hex or another encoding then you have to do the massaging of that too. ) But, the real challenge is trying to determine what should be a Heading1 or a paragraph and making sure that the text is in the correct order (which would mean keeping track of the position on the page and translating relative paths to absolute, which would mean keeping track of the transformation matrix and more...).As a result, it is possible to extract text in a wrong order then what the author intended ( ie English vs Korean or Chinese ).

    Look I could go on for pages about how you MIGHT accomplish some aspects of this( extracting the image ... guessing styles...) but none of them would be 100% accurate. And as many ways are there to design the PDF would have to be thought of in the reverse engineering of the HTML, which begs the whole question of the cost effectiveness of such an endeavor.

    The other thing that strikes me as obvious is that you shouldn't be doing this because PDF 1.3 and higher docs have been optimized for viewing on the Web and have PDF viewers for most web browsers, have hyperlinks in them and even have forms and javascript capability built into them and can be searched by the pdf viewer app already! If you MUST have a HTML version of the document for an audience that cannot use the pdf plug-in (perhaps disabled or deaf?), then you should use the native application from which it was translated into PDF such as MS Word or Word Perfect that have predefined HTML layout templates. If you don't own the document, then you should not be doing this anyway without the authors permission, and I am certain if you have a good reason for needing it in another format, then you would benefit by letting the author provide you with ther approved versions.

    Nuf said...
    JamesNC

      thank you JamesNC for answering why it's so hard to do and why most would be better served converting to HTML from the source with the same proggy that creates the PDF.

      anyways, on a larkish whim i tried this last night:

      $ mkdir ~/public_html/pdf_test $ cd ~/public_html/pdf_test $ convert ~/PDF/ch10.pdf ch10_%02d.html

      and imagine my surprise when it actually worked! but in truth, it doesn't work that well, it does the obvious of PDF -> image -> gif, and the html files are just wrappers to load the gif images of the pages. but, if it's thumbnails of PDF you want then with a bit of scaling this will work. but it's more of a pain to read a giant gif than to download the PDF and use Acrobat.

      if you really want HTML/GIF versions of a PDF then take a look at ImageMagick which provides lot's of conversion options.

Re: How can I convert a pdf to html with PDF::Extract?
by Joost (Canon) on Nov 28, 2003 at 11:45 UTC