Completely off-topic, your post demonstrates the profound stupidity of Unicode ligatures. Ligatures are a typographic trick to make certain sequences of letters like "fi" and "ffi" look pretty when displayed in some media. Comically, the Unicode ligatures not only make life a royal pain for regular expression matching, but they're also ugly as sin (compare the actual "fi" to the "fi"-ligature here). They're even less useful than pages of emoji.The reason Unicode has those particular ligatures is to preserve the originals when doing round‐trip conversions with legacy encodings that allowed such things to be specified with distinct, individual codes. In modern typesetting, such matters should be — and are — taken care of automatically.
It certainly isn’t “ugly as sin”; it looks fine. Of course, if you’re using some brutish sans serif font as your default display and that font hasn’t made allowances for these legacy ligatures, so that you have to resort to some fallback font‐substitution glyph, then well that’s the price you pay for brutishness. 😜
On the other hand, in this sample in Adobe Caslon Pro, I use no ligatures at all; all that is figured out for me by the font itself. For a somewhat subtler effect, here’s that sample again, this time in Adobe Garamond Pro. But for real sophistication, there’s just nothing like that same sample rendered in Zapfino.
All three of those samples are fine examples of good kerning rules that don’t make the user say how and what and where things are tied together — that is, ligated. (Hey, did you know that that ligar con alguien is Spanish slang for “to hook up”, as in “to get laid”?) It all magically falls out of the OpenType rules built into each respective font.
Now, it just so happens that Unicode does have case folds for the legacy ligatures, although these are the one‐to‐many full case folds that next to nobody but Perl even tries to handle. That means this works:
% perl -E 'say "E\x{FB03}ciency" =~ /^effi/i || 0'
1
However, because we don’t allow incomplete matches stranding part of a code point, this doesn’t:
% perl -E 'say "E\x{FB03}ciency"'
Efficiency
% perl -E 'say "E\x{FB03}ciency" =~ /^eff/i || 0'
0
That shows why you really want a compatibility decomposition for text searching:
% perl -MUnicode::Normalize -E 'say NFKD("E\x{FB03}ciency") =~ /^effi/i || 0'
1
% perl -E 'say "3:15 \x{33D8}"'
3:15 ㏘
% perl -MUnicode::Normalize -E 'say NFKD("3:15 \x{33D8}") =~ /\bP\.?M\b/i || 0'
1
I’ll address collation‐strength equivalence, including but not limited to diacritic‐insensitive matching, some other day.
In reply to Re: Silly ligatures
by tchrist
in thread Unearthed Arcana
by tchrist
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |