Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

HTML input to PDF output

by hacker (Priest)
on Jul 24, 2002 at 15:35 UTC ( [id://184895]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a news "portal" site for one of my public Open Source projects, and have a question about PDF output.

Basically I have articles stored in a MySQL database which have some basic HTML formatting elements built into their body. These are not "webpages" per se, since there is no opening or closing <head> or <body> tags, just simple paragraph, bold, lists, and formatting elements. Here's a small example (note, this is the entire article as queried from the database):

<p class="fol">Here's some text that goes in the body of the article. It has some list items like this:</p> <ul> <li>List item one</li> <li>List item two</li> </ul>

When the user selects to read these articles, the articles is put between the "slices of bread" which adds my header and footer elements, according to the layout of the site, and the article body goes in middle.

What I'd like to do is provide a link at the bottom of each article that says "Convert to PDF" and have it link to a sub that can stuff this article into a PDF and present it to the user for view/download.

I have all of the mechanical bits of the query, response, display, etc. working, but need to know if it's possible with the various PDF modules (Text::PDF, PDF::Create, Data::Dbf2pdf, PDFLib) style modules take this article input and convert it into a usable PDF file, retaining the formatting that the HTML provides.

All of the examples I've seen deal with only plain text, not images or HTML-formatted text. I'm not opposed to doing some s/<li>/   o/ stuff where required, but I'd like to eliminate the need to do that kind of barbaric conversion.

edited: Wed Jul 24 16:30:15 2002 by jeffa - fixed closing li tag typo per author's request

Replies are listed 'Best First'.
(wil) Re: HTML input to PDF output
by wil (Priest) on Jul 24, 2002 at 15:41 UTC
    Have you tried HTMLdoc? This converts HTML documents into PDF or Postscript documents preserving as much formatting as possible.

    - wil

      Big ups to HTMLDOC. While not a pure perl solution, it does a VERY nice job rendering HTML into PDF. It'll even keep graphics for you.

      Make a specialized header/footer for the pages that get converted to HTMLDOC to strip out all the extraneous graphic foofery.

      Try HTMLDOC. I guarantee you'll like the results, and it's open source. I use it to create PDFs on the fly in a CMS project of mine, and I'm very impressed with it's speed and the quality of the output. The author is very responsive to fixes/suggestions too.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

      I would prefer not to have to run YASC (Yet Another System Command) from this particular script. I would prefer a perl-only solution, but this may get me by in the short term, until such a solution exists. Good tip.

        Don't create them on the fly then. Pre-create them with a cron job, or only when an item is updated.

        Seriously, try out HTMLDoc. It gives very good output, and will save you huge amounts of time trying to get this to work in perl.

        -Any sufficiently advanced technology is
        indistinguishable from doubletalk.

(jeffa) Re: HTML input to PDF output
by jeffa (Bishop) on Jul 24, 2002 at 15:50 UTC
    A limited solution (no table or frame support) is the Cookbook's Recipe 20.5:
    use strict; use HTML::FormatText; use HTML::Parse; my $data = do {local $/;<DATA>}; my $html = parse_html($data); my $formatter = HTML::FormatText->new( leftmargin => 0, rightmargin => 50, ); my $ascii = $formatter->format($html); print "$ascii\n"; __DATA__ <p class="fol">Here's some text that goes in the body of the article. It has some list items like this:</p> <ul> <li>List item one</li> <li>List item two</li> </ul>
    This generates the following output:
    Here's some text that goes in the body of the
    article. It has some list items like this:
    
      * List item one
    
      * List item two
    
    
    I have found that converting HTML to text is hard, and the best free tool i have found so far is lynx -dump. Of course, the most optimal solution is to never mix presentation with data! :)

    Update: in case you are wondering where that extra bullet came from, it is the result of the closing li tags. Looks like HTML::FormatText could use an upgrade to support XHTML. -- good catch Hero Zzyzzx! ;) I fixed this typo since hacker requested i fix the original. For historical purposes, the first list item looked like so: <li>List item one<li>.

    jeffa

    Remember kids, just say no to mixing data and presentation!

      Not to niggle, but one of the closing li tags isn't really a closing tag-
      <li>List item one<li>
      Note the second li.

      -Any sufficiently advanced technology is
      indistinguishable from doubletalk.

Re: HTML input to PDF output
by jsegal (Friar) on Jul 24, 2002 at 16:02 UTC
    You could also try html2ps and then run the output through ps2pdf (which comes from the ghostscript distribution.)
    Note -- I have used the latter (ps2pdf) extensively, and it works great. I have never tried html2ps, but it is written in perl and is GPLed, which gives it a ++ in my book!

    --JAS
Re: HTML input to PDF output
by Stegalex (Chaplain) on Jul 24, 2002 at 21:43 UTC
    What I do is run my html through a utility called html2ps and then I run the resulting postscript through ps2pdf.
    It's slow so I have to agree with the others who advise you to avoid doing this "on-the-fly".

    ~~~~~~~~~~~~~~~
    I like chicken.
Re: HTML input to PDF output
by nmerriweather (Friar) on Jul 25, 2002 at 01:17 UTC
    isn't there a module/program called html2pdf or something like that?
      yes, there is on the html2pdf Homepage ...

      but unforntunately it doesn't support too much tags, and list is missing.

      ----
      NaSe
      :x

Re: HTML input to PDF output
by yogivan (Acolyte) on Jul 25, 2002 at 08:13 UTC
    HTMLDOC is certainly a very good solution. It's fast, and it really does a good job in converting html to pdf. -yogivan

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://184895]
Front-paged by TStanley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-04-25 19:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found