HTML input to PDF output

hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a news "portal" site for one of my public Open Source projects, and have a question about PDF output.

Basically I have articles stored in a MySQL database which have some basic HTML formatting elements built into their body. These are not "webpages" per se, since there is no opening or closing <head> or <body> tags, just simple paragraph, bold, lists, and formatting elements. Here's a small example (note, this is the entire article as queried from the database):

<p class="fol">Here's some text that goes in the body
of the article. It has some list items like this:</p> 
<ul>
<li>List item one</li>
<li>List item two</li>
</ul>
[download]

When the user selects to read these articles, the articles is put between the "slices of bread" which adds my header and footer elements, according to the layout of the site, and the article body goes in middle.

What I'd like to do is provide a link at the bottom of each article that says "Convert to PDF" and have it link to a sub that can stuff this article into a PDF and present it to the user for view/download.

I have all of the mechanical bits of the query, response, display, etc. working, but need to know if it's possible with the various PDF modules (Text::PDF, PDF::Create, Data::Dbf2pdf, PDFLib) style modules take this article input and convert it into a usable PDF file, retaining the formatting that the HTML provides.

All of the examples I've seen deal with only plain text, not images or HTML-formatted text. I'm not opposed to doing some s/<li>/ o/ stuff where required, but I'd like to eliminate the need to do that kind of barbaric conversion.

edited: Wed Jul 24 16:30:15 2002 by jeffa - fixed closing li tag typo per author's request

Comment on HTML input to PDF output Select or Download Code

Replies are listed 'Best First'.
(wil) Re: HTML input to PDF output by wil (Priest) on Jul 24, 2002 at 15:41 UTC
Have you tried HTMLdoc? This converts HTML documents into PDF or Postscript documents preserving as much formatting as possible. - wil	[reply]
Re: (wil) Re: HTML input to PDF output by Hero Zzyzzx (Curate) on Jul 24, 2002 at 16:05 UTC
Big ups to HTMLDOC. While not a pure perl solution, it does a VERY nice job rendering HTML into PDF. It'll even keep graphics for you. Make a specialized header/footer for the pages that get converted to HTMLDOC to strip out all the extraneous graphic foofery. Try HTMLDOC. I guarantee you'll like the results, and it's open source. I use it to create PDFs on the fly in a CMS project of mine, and I'm very impressed with it's speed and the quality of the output. The author is very responsive to fixes/suggestions too. -Any sufficiently advanced technology is indistinguishable from doubletalk.	[reply]
Re: HTML input to PDF output by hacker (Priest) on Jul 24, 2002 at 16:16 UTC
I would prefer not to have to run YASC (Yet Another System Command) from this particular script. I would prefer a perl-only solution, but this may get me by in the short term, until such a solution exists. Good tip.	[reply]
Re: Re: HTML input to PDF output by Hero Zzyzzx (Curate) on Jul 24, 2002 at 16:29 UTC
Don't create them on the fly then. Pre-create them with a cron job, or only when an item is updated. Seriously, try out HTMLDoc. It gives very good output, and will save you huge amounts of time trying to get this to work in perl. -Any sufficiently advanced technology is indistinguishable from doubletalk.	[reply]
(jeffa) Re: HTML input to PDF output by jeffa (Bishop) on Jul 24, 2002 at 15:50 UTC
A limited solution (no table or frame support) is the Cookbook's Recipe 20.5: `use strict; use HTML::FormatText; use HTML::Parse; my $data = do {local $/;<DATA>}; my $html = parse_html($data); my $formatter = HTML::FormatText->new( leftmargin => 0, rightmargin => 50, ); my $ascii = $formatter->format($html); print "$ascii\n"; __DATA__ <p class="fol">Here's some text that goes in the body of the article. It has some list items like this:</p> <ul> <li>List item one</li> <li>List item two</li> </ul>` [download] This generates the following output: Here's some text that goes in the body of the article. It has some list items like this: * List item one * List item two I have found that converting HTML to text is hard, and the best free tool i have found so far is `lynx -dump`. Of course, the most optimal solution is to never mix presentation with data! :) Update: ~~in case you are wondering where that extra bullet came from, it is the result of the closing li tags. Looks like HTML::FormatText could use an upgrade to support XHTML.~~ -- good catch Hero Zzyzzx! ;) I fixed this typo since hacker requested i fix the original. For historical purposes, the first list item looked like so: `<li>List item one<li>`. jeffa Remember kids, just say no to mixing data and presentation!	[reply] [d/l]
Re: (jeffa) Re: HTML input to PDF output by Hero Zzyzzx (Curate) on Jul 24, 2002 at 16:26 UTC
Not to niggle, but one of the closing li tags isn't really a closing tag- `<li>List item one<li>` Note the second li. -Any sufficiently advanced technology is indistinguishable from doubletalk.	[reply] [d/l]
Re: HTML input to PDF output by jsegal (Friar) on Jul 24, 2002 at 16:02 UTC
You could also try html2ps and then run the output through ps2pdf (which comes from the ghostscript distribution.) Note -- I have used the latter (ps2pdf) extensively, and it works great. I have never tried html2ps, but it is written in perl and is GPLed, which gives it a ++ in my book! --JAS	[reply]
Re: HTML input to PDF output by Stegalex (Chaplain) on Jul 24, 2002 at 21:43 UTC
What I do is run my html through a utility called `html2ps` and then I run the resulting postscript through `ps2pdf`. It's slow so I have to agree with the others who advise you to avoid doing this "on-the-fly". ~~~~~~~~~~~~~~~ I like chicken.	[reply] [d/l] [select]
Re: HTML input to PDF output by nmerriweather (Friar) on Jul 25, 2002 at 01:17 UTC
isn't there a module/program called html2pdf or something like that?	[reply]
Re: Re: HTML input to PDF output by NaSe77 (Monk) on Jul 25, 2002 at 07:17 UTC
yes, there is on the html2pdf Homepage ... but unforntunately it doesn't support too much tags, and `list` is missing. ---- NaSe :x	[reply]
Re: HTML input to PDF output by yogivan (Acolyte) on Jul 25, 2002 at 08:13 UTC
HTMLDOC is certainly a very good solution. It's fast, and it really does a good job in converting html to pdf. -yogivan	[reply]


XP is just a number
	PerlMonks