comment on

Completely off-topic, your post demonstrates the profound stupidity of Unicode ligatures. Ligatures are a typographic trick to make certain sequences of letters like "fi" and "ffi" look pretty when displayed in some media. Comically, the Unicode ligatures not only make life a royal pain for regular expression matching, but they're also ugly as sin (compare the actual "fi" to the "fi"-ligature here). They're even less useful than pages of emoji.

The reason Unicode has those particular ligatures is to preserve the originals when doing round‐trip conversions with legacy encodings that allowed such things to be specified with distinct, individual codes. In modern typesetting, such matters should be — and are — taken care of automatically.

¡Fontalicious!

On the matter of being ugly as sin, here is my emoji example where I actually use fi ligatures three times, just because that was a posting where I was being extreme in the font games. If you look closely at that example, they do look marginally better there than the unkerned alternatives, although not so much that you would normally even notice them. Which is just as it should be.

It certainly isn’t “ugly as sin”; it looks fine. Of course, if you’re using some brutish sans serif font as your default display and that font hasn’t made allowances for these legacy ligatures, so that you have to resort to some fallback font‐substitution glyph, then well that’s the price you pay for brutishness. 😜

On the other hand, in this sample in Adobe Caslon Pro, I use no ligatures at all; all that is figured out for me by the font itself. For a somewhat subtler effect, here’s that sample again, this time in Adobe Garamond Pro. But for real sophistication, there’s just nothing like that same sample rendered in Zapfino.

All three of those samples are fine examples of good kerning rules that don’t make the user say how and what and where things are tied together — that is, ligated. (Hey, did you know that that ligar con alguien is Spanish slang for “to hook up”, as in “to get laid”?) It all magically falls out of the OpenType rules built into each respective font.

`NFKD($s) =~ /⋯/i`

Now, regarding the regex matter. The legacy ligatures are actually doing people a service here, because they make it obvious that you cannot just do blind searches on unnormalized Unicode text. Regexes make no allowances for things like default ignorables, diacritic‐insensitive comparisons, decompositions, or collation‐strength equivalences. And you need all those things.

Now, it just so happens that Unicode does have case folds for the legacy ligatures, although these are the one‐to‐many full case folds that next to nobody but Perl even tries to handle. That means this works:

 % perl -E 'say "E\x{FB03}ciency" =~ /^effi/i || 0'
1

However, because we don’t allow incomplete matches stranding part of a code point, this doesn’t:


% perl -E 'say "E\x{FB03}ciency"'
Eﬃciency
 % perl -E 'say "E\x{FB03}ciency" =~ /^eff/i || 0'
0

That shows why you really want a compatibility decomposition for text searching:


 % perl -MUnicode::Normalize -E 'say NFKD("E\x{FB03}ciency") =~ /^effi/i || 0'
1

 % perl -E 'say "3:15 \x{33D8}"'
3:15 ㏘
 % perl -MUnicode::Normalize -E 'say NFKD("3:15 \x{33D8}") =~ /\bP\.?M\b/i || 0'
1

I’ll address collation‐strength equivalence, including but not limited to diacritic‐insensitive matching, some other day.

In reply to Re: Silly ligatures by tchrist
in thread Unearthed Arcana by tchrist

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.