comment on

Interesting. What's definite is that HTML::FormatPS is recognizing the UNICODE, and is balking at crushing down a potentially 65560+ character set down to a suite of character sets, each with somewhat less than 256 characters.

While I haven't worked on this exact problem, I have been considering how to support UNICODE in an HTML-based lexer. I would fall back on a series of heuristics for preprocessing the HTML to something that one of HTML::FormatPS or html2ps will deal with. The following is relevant only if no dedicated modules show up. It is understood that insofar as relevant Perl modules exist, they are to be used in preference to de novo implementations.

First, become familiar with the UNICODE charts at http://www.unicode.org/charts/. You need to understand if it is possible to reduce the UNICODE to single-byte fonts. If it is possible, you need to understand how to reduce the UNICODE to single-byte fonts.

Filter #0: deploy and use the Unicode::Normalize, Unicode::UCD, and Encode::Unicode modules when manipulating the UNICODE in the HTML. There's no sense in reinventing the wheel, and it will simplify things later to know you're using a given standard form. I would be looking at the standard forms that use canonical composition (NFC and NKFC). In particular: if it decodes to UTF-8 with Encode::Unicode, that's done. Otherwise, I would work with whichever of the NFC or NKFC forms (Unicode::Normalize) proved more relevant.

Filter #1: You need to be aware of what additional 1-byte fonts also have the UNICODE characters you want. If I was trying to render mathematical literature on a Windows system, I would recommend the Symbol font, a font specifically for Greek, a font specifically for Hebrew, and a font specifically for Russian. This covers just about everything an undergraduate or Master's math student would use except the special notation for the set of integers, and the 'not-congruent' symbol for number theory. You will have to map these UNICODE characters to their corresponding representations in <font> tags. (These should be more digestible to both HTML::FormatPS and html2ps.)

I regret not being able to recommend a quick solution. Remember, you only need something that handles your documents.

In reply to Re: HTML to PDF with Unicode by zaimoni
in thread HTML to PDF with Unicode by justinNEE

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.