justinNEE has asked for the wisdom of the Perl Monks concerning the following question:

I have seen various solutions to convert HTML documents to Postscript or PDF but I have a little twist to throw into the problem: Unicode characters in the HTML page.

With HTML::FormatPS Unicode characters show up as [NOT SHOWN] , with the program html2ps they show up as crap. Is this a limitation of html2ps or of the font?

Is there a way to convert HTML with Unicode characters into readable PDF with perl so that I can do some additional processing?

Replies are listed 'Best First'.
Re: HTML to PDF with Unicode
by graff (Chancellor) on Nov 08, 2002 at 03:56 UTC
    This is a shot in the dark -- I'm ignorant about PDF -- but I noticed that one of the character encodings supported by the Perl 5.8 Encode module is called "AdobeStandardEncoding". You might try converting the HTML/Unicode data to this encoding and see what happens... (The man page for the Encode module will explain ways to do the conversion.)

    (I'll be glad to update this as soon as someone more knowledgeable weighs in -- there's a good chance I'm way off.) Also, maybe you can figure out which unicode characters are showing up, and create acceptable substitutes for these.

Re: HTML to PDF with Unicode
by zaimoni (Beadle) on Nov 08, 2002 at 04:37 UTC

    Interesting. What's definite is that HTML::FormatPS is recognizing the UNICODE, and is balking at crushing down a potentially 65560+ character set down to a suite of character sets, each with somewhat less than 256 characters.

    While I haven't worked on this exact problem, I have been considering how to support UNICODE in an HTML-based lexer. I would fall back on a series of heuristics for preprocessing the HTML to something that one of HTML::FormatPS or html2ps will deal with. The following is relevant only if no dedicated modules show up. It is understood that insofar as relevant Perl modules exist, they are to be used in preference to de novo implementations.

    First, become familiar with the UNICODE charts at http://www.unicode.org/charts/. You need to understand if it is possible to reduce the UNICODE to single-byte fonts. If it is possible, you need to understand how to reduce the UNICODE to single-byte fonts.

    Filter #0: deploy and use the Unicode::Normalize, Unicode::UCD, and Encode::Unicode modules when manipulating the UNICODE in the HTML. There's no sense in reinventing the wheel, and it will simplify things later to know you're using a given standard form. I would be looking at the standard forms that use canonical composition (NFC and NKFC). In particular: if it decodes to UTF-8 with Encode::Unicode, that's done. Otherwise, I would work with whichever of the NFC or NKFC forms (Unicode::Normalize) proved more relevant.

    Filter #1: You need to be aware of what additional 1-byte fonts also have the UNICODE characters you want. If I was trying to render mathematical literature on a Windows system, I would recommend the Symbol font, a font specifically for Greek, a font specifically for Hebrew, and a font specifically for Russian. This covers just about everything an undergraduate or Master's math student would use except the special notation for the set of integers, and the 'not-congruent' symbol for number theory. You will have to map these UNICODE characters to their corresponding representations in <font> tags. (These should be more digestible to both HTML::FormatPS and html2ps.)

    I regret not being able to recommend a quick solution. Remember, you only need something that handles your documents.

Re: HTML to PDF with Unicode
by Willard B. Trophy (Hermit) on Nov 08, 2002 at 14:34 UTC
    Adobe provides the Adobe Glyph List here: glyphlist.txt. Explanation on how to use it is here: Adobe Solutions Network: Unicode and Glyph Names.

    As it stands, it's really designed for for mapping Adobe glyphs to Unicode, but it's really just a two-column mapping. We know how to deal with them …

    As has already been sagely suggested, I won't try to go for a general solution; just implement the glyphs you need now. The "Classic 35" LaserWriter fonts that you can pretty much guarantee are built into any PS engine (and ghostscript) these days are rather limiting.

    I'm assuming that your PS application supports glyph names (ie /M glyphshow instead of/as well as (M) show). Many of the specialised characters don't have encoding numbers, so can't be mapped using traditional integer techniques.

    --
    $,="\n";foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$a++;$_=unpack('B8',$_);tr,01,\40#,;$b[$a%6].=$_};print@b,"\n"

Re: HTML to PDF with Unicode
by mattr (Curate) on Nov 08, 2002 at 15:06 UTC
    Not a direct answer, but xpdf can do non-roman charsets like Japanese/Chinese depending on compilation. As you probably know..