in reply to unicode normalization

The Unicode::Normalize might be kind of a sledge hammer for the task that you're talking about. "Normalization" covers the equivalences between single-codepoint "complex" characters, e.g. é (U00C9) and concatentations of "component" characters, e.g. e + ́ (U0045 + U0301) -- this also known as character (de)composition. These issues are especially knotty in some languages, and this is what the Unicode::Normalize module is all about. (Note that the result of Normalization often still consists of wide characters.)

You're task seems more like "replace a wide character whenever there is an obvious ascii substitute", which is much simpler; this could apply to various quotes, brackets and other punctuation as well as spaces and hyphens/dashes. (The use of wide-character "smart quotes" seems to be on the rise).

If you have wide-character spaces in a utf8 string that was decoded from HTML, turning them all into ascii spaces is easy:

s/\s/ /g;
(Of course, that will apply to newlines and tabs as well, but with html data, this isn't likely to be a problem.)

As for the various punctuation marks, if you already know which wide characters to expect, just put those into a regex character class:

my $dashes = join '', map { chr() } ( 0xAD, 0x2010 .. 0x2015, 0xFE63, +0xFF0D ); my $squots = join '', map { chr() } ( 0x02BC, 0x2018 .. 0x201B ); my $dquots = join '', map { chr() } ( 0x02EE, 0x201C .. 0x201F ); s/[$dashes]/-/g; s/[$squots]/'/g; s/[$dquots]/"/g;
If you run across any wide characters besides those, you can look them up pretty easily and add to your character classes as needed. Here's a simple script for getting the names of various codepoints, codepoints that match various names, etc.

(update: fixed a typo in the assignment to "$dashes")

Replies are listed 'Best First'.
Re: unicode normalization
by mscudder (Initiate) on May 08, 2006 at 00:34 UTC
    Many thanks to wfsp and graff for your very helpful suggestions, code, and referral to The Björk Situation.

    My solution below, designed for thoroughness rather than speed. Guaranteed at least 150% effective!

    Regards,
    Michael

    use HTML::Entities; use HTML::Tagset; my @_html_entities=( # --------------------------------------------------- # 0 1 2 3 4 5 # char equiv entity entity codepoint description # --------------------------------------------------- ['"', '', 'quot', 34, 'U+0022', 'quotation mark=APL quote', ], ['&', '', 'amp', 38, 'U+0026', 'ampersand', ], ['<', '', 'lt', 60, 'U+003C', 'less-than sign',], ['>', '', 'gt', 62, 'U+003E', 'greater-than sign',], ['', '...','', 133, 'U+0085', '',], ['', '-', '', 150, '', '',], ['', '-', '', 151, '', '',], ['¡', '', 'iexcl', 161, 'U+00A1', 'inverted exclamation mark',], ['¢', '', 'cent', 162, 'U+00A2', 'cent sign', ], ['£', '', 'pound', 163, 'U+00A3', 'pound sign', ], ['¤', '', 'curren', 164, 'U+00A4', 'currency sign', ], ['¥', '', 'yen', 165, 'U+00A5', 'yen sign = yuan sign', ], ['¦', '', 'brvbar', 166, 'U+00A6', 'broken vertical bar',], ['§', '', 'sect', 167, 'U+00A7', 'section sign', ], ['¨', '', 'uml', 168, 'U+00A8', 'diaeresis', ], ['©', '', 'copy', 169, 'U+00A9', 'copyright sign', ], ['ª', '', 'ordf', 170, 'U+00AA', 'feminine ordinal indicator',], # etc. (complete table below) ); my %_entity2char=(); # HTML entity character equivalents my %_char2equiv=(); # my preferred 'ASCII-compatible' character equiv +alents my $_dashes=join '', map { chr() } ( 0x096, 0x097, 0x058A, 0x1806, 0x2010..0x2015, 0x2053, 0x207B, 0x208B, 0x2212, 0xFE63, 0xFF0D); my $_squots=join '', map { chr() } ( 0x02BC, 0x2018..0x201A, 0x2032 ); my $_dquots=join '', map { chr() } ( 0x02EE, 0x201C..0x201E ); my $_spaces=join '', map { chr() } ( 0x2000..0x200B, 0x202F, 0x205F, 0x3000); my $_dots =join '', map { chr() } ( 0x2022, 0x22C5); sub scrub { my $text = shift; return "" if !$text; # remove HTML phrasal level tags foreach my $markup (keys %HTML::Tagset::isPhraseMarkup) { $text=~s/<\s?\/?$markup\s?>/ /gi; } #decode html entities for (1..3) { # assume no more than triple nested html entities HTML::Entities::decode_entities($text) if ($text=~/&#?[a-zA-Z0-9]+;/); } # replace character escapes $text=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; # replace 'wide character' whitespace with # ascii-compatible whitespace $text=~s/\s/ /g; $text=~s/[$_spaces]/ /g; # transliterate 'wide character' punctuation to # ascii-compatible equivalents # with thanks to graff of Perl Monks for this code $text=~s/[$_dashes]/-/g; $text=~s/[$_squots]/\'/g; $text=~s/[$_dquots]/"/g; $text=~s/[$_dots]/\x{00B7}/g; # replace remaining 'wide' characters with # (my preferred) ascii-compatible equivalents $text=~s/(.)/$_char2equiv{$1}?$_char2equiv{$1}:$1/eg; # unidecode any remaining characters greater than 0xff if ($text=~/[\x{100}-\x{ffff}]/) { my @chars=split //, $text; # "Text::Unidecode is meant to be a # transliterator-of-last resort,..." foreach my $char (@chars) { $char=unidecode($char) if $char=~/[\x{0100}-\x{ffff}]/; } $text=join '', @chars; # strip out remaining 'wide' characters $text=~s/[\x{0100}-\x{ffff}]//g; } # trim leading, trailing, and excess whitespace $text=~s/^\s+//; $text=~s/\s{2,}/ /g; $text=~s/\s+$//; return $text; } ####################################### # initialization # ####################################### BEGIN { foreach my $entity (@_html_entities) { $entity->[0]=chr($entity->[3]); $_entity2char{$entity->[2]}=$entity->[0]; $_entity2char{$entity->[3]}=$entity->[0]; $_char2equiv{$entity->[0]}=$entity->[1] if $entity->[1]; } } my @_html_entities=( # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['"', '', 'quot', 34, 'U+0022', 'quotation mark = A +PL quote', ], ['&', '', 'amp', 38, 'U+0026', 'ampersand', + ], ['<', '', 'lt', 60, 'U+003C', 'less-than sign', + ], ['>', '', 'gt', 62, 'U+003E', 'greater-than sign' +, ], ['', '...', '', 133, 'U+0085', '', + ], ['', '-', '', 150, '', '', + ], ['', '-', '', 151, '', '', + ], ['¡', '', 'iexcl', 161, 'U+00A1', 'inverted exclamati +on mark', ], ['¢', '', 'cent', 162, 'U+00A2', 'cent sign', + ], ['£', '', 'pound', 163, 'U+00A3', 'pound sign', + ], ['¤', '', 'curren', 164, 'U+00A4', 'currency sign', + ], ['¥', '', 'yen', 165, 'U+00A5', 'yen sign = yuan si +gn', ], ['¦', '', 'brvbar', 166, 'U+00A6', 'broken bar = broke +n vertical bar', ], ['§', '', 'sect', 167, 'U+00A7', 'section sign', + ], ['¨', '', 'uml', 168, 'U+00A8', 'diaeresis = spacin +g diaeresis', ], ['©', '', 'copy', 169, 'U+00A9', 'copyright sign', + ], ['ª', '', 'ordf', 170, 'U+00AA', 'feminine ordinal i +ndicator', ], ['«', '', 'laquo', 171, 'U+00AB', 'left-pointing doub +le angle quotation mark= left pointing guillemet', ], ['¬', '', 'not', 172, 'U+00AC', 'not sign', + ], ['­', '', 'shy', 173, 'U+00AD', 'soft hyphen = disc +retionary hyphen', ], ['®', '', 'reg', 174, 'U+00AE', 'registered sign = +registered trade mark sign', ], ['¯', '', 'macr', 175, 'U+00AF', 'macron = spacing m +acron = overline= APL overbar', ], ['°', '', 'deg', 176, 'U+00B0', 'degree sign', + ], ['±', '', 'plusmn', 177, 'U+00B1', 'plus-minus sign = +plus-or-minus sign', ], ['²', '', 'sup2', 178, 'U+00B2', 'superscript two = +superscript digit two= squared', ], ['³', '', 'sup3', 179, 'U+00B3', 'superscript three += superscript digit three= cubed', ], ['´', '', 'acute', 180, 'U+00B4', 'acute accent = spa +cing acute', ], ['µ', '', 'micro', 181, 'U+00B5', 'micro sign', + ], ['¶', '', 'para', 182, 'U+00B6', 'pilcrow sign = par +agraph sign', ], ['·', '', 'middot', 183, 'U+00B7', 'middle dot = Georg +ian comma= Greek middle dot', ], ['¸', '', 'cedil', 184, 'U+00B8', 'cedilla = spacing +cedilla', ], ['¹', '', 'sup1', 185, 'U+00B9', 'superscript one = +superscript digit one', ], ['º', '', 'ordm', 186, 'U+00BA', 'masculine ordinal +indicator', ], ['»', '', 'raquo', 187, 'U+00BB', 'right-pointing dou +ble angle quotation mark= right pointing guillemet',], ['¼', '', 'frac14', 188, 'U+00BC', 'vulgar fraction on +e quarter= fraction one quarter', ], ['½', '', 'frac12', 189, 'U+00BD', 'vulgar fraction on +e half= fraction one half', ], ['¾', '', 'frac34', 190, 'U+00BE', 'vulgar fraction th +ree quarters= fraction three quarters', ], ['¿', '', 'iquest', 191, 'U+00BF', 'inverted question +mark= turned question mark', ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['À', '', 'Agrave', 192, 'U+00C0', 'latin capital lett +er A with grave= latin capital letter A grave', ], ['Á', '', 'Aacute', 193, 'U+00C1', 'latin capital lett +er A with acute', ], ['Â', '', 'Acirc', 194, 'U+00C2', 'latin capital lett +er A with circumflex', ], ['Ã', '', 'Atilde', 195, 'U+00C3', 'latin capital lett +er A with tilde', ], ['Ä', '', 'Auml', 196, 'U+00C4', 'latin capital lett +er A with diaeresis', ], ['Å', '', 'Aring', 197, 'U+00C5', 'latin capital lett +er A with ring above= latin capital letter A ring', ], ['Æ', '', 'AElig', 198, 'U+00C6', 'latin capital lett +er AE= latin capital ligature AE', ], ['Ç', '', 'Ccedil', 199, 'U+00C7', 'latin capital lett +er C with cedilla', ], ['È', '', 'Egrave', 200, 'U+00C8', 'latin capital lett +er E with grave', ], ['É', '', 'Eacute', 201, 'U+00C9', 'latin capital lett +er E with acute', ], ['Ê', '', 'Ecirc', 202, 'U+00CA', 'latin capital lett +er E with circumflex', ], ['Ë', '', 'Euml', 203, 'U+00CB', 'latin capital lett +er E with diaeresis', ], ['Ì', '', 'Igrave', 204, 'U+00CC', 'latin capital lett +er I with grave', ], ['Í', '', 'Iacute', 205, 'U+00CD', 'latin capital lett +er I with acute', ], ['Î', '', 'Icirc', 206, 'U+00CE', 'latin capital lett +er I with circumflex', ], ['Ï', '', 'Iuml', 207, 'U+00CF', 'latin capital lett +er I with diaeresis', ], ['Ð', '', 'ETH', 208, 'U+00D0', 'latin capital lett +er ETH', ], ['Ñ', '', 'Ntilde', 209, 'U+00D1', 'latin capital lett +er N with tilde', ], ['Ò', '', 'Ograve', 210, 'U+00D2', 'latin capital lett +er O with grave', ], ['Ó', '', 'Oacute', 211, 'U+00D3', 'latin capital lett +er O with acute', ], ['Ô', '', 'Ocirc', 212, 'U+00D4', 'latin capital lett +er O with circumflex', ], ['Õ', '', 'Otilde', 213, 'U+00D5', 'latin capital lett +er O with tilde', ], ['Ö', '', 'Ouml', 214, 'U+00D6', 'latin capital lett +er O with diaeresis', ], ['×', '', 'times', 215, 'U+00D7', 'multiplication sig +n', ], ['Ø', '', 'Oslash', 216, 'U+00D8', 'latin capital lett +er O with stroke= latin capital letter O slash', ], ['Ù', '', 'Ugrave', 217, 'U+00D9', 'latin capital lett +er U with grave', ], ['Ú', '', 'Uacute', 218, 'U+00DA', 'latin capital lett +er U with acute', ], ['Û', '', 'Ucirc', 219, 'U+00DB', 'latin capital lett +er U with circumflex', ], ['Ü', '', 'Uuml', 220, 'U+00DC', 'latin capital lett +er U with diaeresis', ], ['Ý', '', 'Yacute', 221, 'U+00DD', 'latin capital lett +er Y with acute', ], ['Þ', '', 'THORN', 222, 'U+00DE', 'latin capital lett +er THORN', ], ['ß', '', 'szlig', 223, 'U+00DF', 'latin small letter + sharp s = ess-zed', ], ['à', '', 'agrave', 224, 'U+00E0', 'latin small letter + a with grave= latin small letter a grave', ], ['á', '', 'aacute', 225, 'U+00E1', 'latin small letter + a with acute', ], ['â', '', 'acirc', 226, 'U+00E2', 'latin small letter + a with circumflex', ], ['ã', '', 'atilde', 227, 'U+00E3', 'latin small letter + a with tilde', ], ['ä', '', 'auml', 228, 'U+00E4', 'latin small letter + a with diaeresis', ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['å', '', 'aring', 229, 'U+00E5', 'latin small letter + a with ring above= latin small letter a ring', ], ['æ', '', 'aelig', 230, 'U+00E6', 'latin small letter + ae= latin small ligature ae', ], ['ç', '', 'ccedil', 231, 'U+00E7', 'latin small letter + c with cedilla', ], ['è', '', 'egrave', 232, 'U+00E8', 'latin small letter + e with grave', ], ['é', '', 'eacute', 233, 'U+00E9', 'latin small letter + e with acute', ], ['ê', '', 'ecirc', 234, 'U+00EA', 'latin small letter + e with circumflex', ], ['ë', '', 'euml', 235, 'U+00EB', 'latin small letter + e with diaeresis', ], ['ì', '', 'igrave', 236, 'U+00EC', 'latin small letter + i with grave', ], ['í', '', 'iacute', 237, 'U+00ED', 'latin small letter + i with acute', ], ['î', '', 'icirc', 238, 'U+00EE', 'latin small letter + i with circumflex', ], ['ï', '', 'iuml', 239, 'U+00EF', 'latin small letter + i with diaeresis', ], ['ð', '', 'eth', 240, 'U+00F0', 'latin small letter + eth', ], ['ñ', '', 'ntilde', 241, 'U+00F1', 'latin small letter + n with tilde', ], ['ò', '', 'ograve', 242, 'U+00F2', 'latin small letter + o with grave', ], ['ó', '', 'oacute', 243, 'U+00F3', 'latin small letter + o with acute', ], ['ô', '', 'ocirc', 244, 'U+00F4', 'latin small letter + o with circumflex', ], ['õ', '', 'otilde', 245, 'U+00F5', 'latin small letter + o with tilde', ], ['ö', '', 'ouml', 246, 'U+00F6', 'latin small letter + o with diaeresis', ], ['÷', '', 'divide', 247, 'U+00F7', 'division sign', + ], ['ù', '', 'ugrave', 249, 'U+00F9', 'latin small letter + u with grave', ], ['ú', '', 'uacute', 250, 'U+00FA', 'latin small letter + u with acute', ], ['û', '', 'ucirc', 251, 'U+00FB', 'latin small letter + u with circumflex', ], ['ü', '', 'uuml', 252, 'U+00FC', 'latin small letter + u with diaeresis', ], ['ý', '', 'yacute', 253, 'U+00FD', 'latin small letter + y with acute', ], ['þ', '', 'thorn', 254, 'U+00FE', 'latin small letter + thorn', ], ['ÿ', '', 'yuml', 255, 'U+00FF', 'latin small letter + y with diaeresis', ], ['Œ', 'OE', 'OElig', 338, 'U+0152', 'latin capital liga +ture OE', ], ['œ', 'oe', 'oelig', 339, 'U+0153', 'latin small ligatu +re oe', ], ['Š', 'S', 'Scaron', 352, 'U+0160', 'latin capital lett +er S with caron', ], ['š', 's', 'scaron', 353, 'U+0161', 'latin small letter + s with caron', ], ['Ÿ', 'Y', 'Yuml', 376, 'U+0178', 'latin capital lett +er Y with diaeresis', ], ['ƒ', 'f', 'fnof', 402, 'U+0192', 'latin small f with + hook = function= florin', ], ['ˆ', '', 'circ', 710, 'U+02C6', 'modifier letter ci +rcumflex accent', ], ['˜', '', 'tilde', 732, 'U+02DC', 'small tilde', + ], ['&#915;', ' Gamma ', 'Gamma', 915, 'U+0393', 'greek capital + letter gamma', ], ['&#916;', ' Delta ', 'Delta', 916, 'U+0394', 'greek capital + letter delta', ], ['&#920;', ' Theta ', 'Theta', 920, 'U+0398', 'greek capital + letter theta', ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['&#923;', ' Lambda ', 'Lambda', 923, 'U+039B', 'greek capital + letter lambda', ], ['&#926;', ' Xi ', 'Xi', 926, 'U+039E', 'greek capital + letter xi', ], ['&#928;', ' Pi ', 'Pi', 928, 'U+03A0', 'greek capital + letter pi', ], ['&#931;', ' Sigma ', 'Sigma', 931, 'U+03A3', 'greek capital + letter sigma', ], ['&#933;', ' Upsilon ', 'Upsilon', 933, 'U+03A5', 'greek capital + letter upsilon', ], ['&#934;', ' Phi ', 'Phi', 934, 'U+03A6', 'greek capital + letter phi', ], ['&#936;', ' Psi ', 'Psi', 936, 'U+03A8', 'greek capital + letter psi', ], ['&#937;', ' Omega ', 'Omega', 937, 'U+03A9', 'greek capital + letter omega', ], ['&#945;', ' alpha ', 'alpha', 945, 'U+03B1', 'greek small l +etter alpha', ], ['&#946;', ' beta ', 'beta', 946, 'U+03B2', 'greek small l +etter beta', ], ['&#947;', ' gamma ', 'gamma', 947, 'U+03B3', 'greek small l +etter gamma', ], ['&#948;', ' delta ', 'delta', 948, 'U+03B4', 'greek small l +etter delta', ], ['&#949;', ' epsilon ', 'epsilon', 949, 'U+03B5', 'greek small l +etter epsilon', ], ['&#951;', ' eta ', 'eta', 951, 'U+03B7', 'greek small l +etter eta', ], ['&#952;', ' theta ', 'theta', 952, 'U+03B8', 'greek small l +etter theta', ], ['&#953;', ' iota ', 'iota', 953, 'U+03B9', 'greek small l +etter iota', ], ['&#954;', ' kappa ', 'kappa', 954, 'U+03BA', 'greek small l +etter kappa', ], ['&#955;', ' lambda ', 'lambda', 955, 'U+03BB', 'greek small l +etter lambda', ], ['&#956;', ' mu ', 'mu', 956, 'U+03BC', 'greek small l +etter mu', ], ['&#957;', ' nu ', 'nu', 957, 'U+03BD', 'greek small l +etter nu', ], ['&#958;', ' xi ', 'xi', 958, 'U+03BE', 'greek small l +etter xi', ], ['&#959;', ' omicron ', 'omicron', 959, 'U+03BF', 'greek small l +etter omicron', ], ['&#960;', ' pi ', 'pi', 960, 'U+03C0', 'greek small l +etter pi', ], ['&#961;', ' rho ', 'rho', 961, 'U+03C1', 'greek small l +etter rho', ], ['&#962;', ' sigma ', 'sigmaf', 962, 'U+03C2', 'greek small l +etter final sigma', ], ['&#963;', ' sigma ', 'sigma', 963, 'U+03C3', 'greek small l +etter sigma', ], ['&#964;', ' tau ', 'tau', 964, 'U+03C4', 'greek small l +etter tau', ], ['&#965;', ' upsilon ', 'upsilon', 965, 'U+03C5', 'greek small l +etter upsilon', ], ['&#966;', ' phi ', 'phi', 966, 'U+03C6', 'greek small l +etter phi', ], ['&#967;', ' chi ', 'chi', 967, 'U+03C7', 'greek small l +etter chi', ], ['&#968;', ' psi ', 'psi', 968, 'U+03C8', 'greek small l +etter psi', ], ['&#969;', ' omega ', 'omega', 969, 'U+03C9', 'greek small l +etter omega', ], ['&#977;', ' theta ', 'thetasym', 977, 'U+03D1', 'greek small l +etter theta symbol', ], ['&#978;', ' upsilon ', 'upsih', 978, 'U+03D2', 'greek upsilon + with hook symbol', ], ['&#982;', ' pi ', 'piv', 982, 'U+03D6', 'greek pi symb +ol', ], ['&#8194;', ' ', 'ensp', 8194, 'U+2002', 'en space', + ], ['&#8195;', ' ', 'emsp', 8195, 'U+2003', 'em space', + ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['&#8201;', ' ', 'thinsp', 8201, 'U+2009', 'thin space', + ], ['&#8204;', '', 'zwnj', 8204, 'U+200C', 'zero width n +on-joiner', ], ['&#8205;', '', 'zwj', 8205, 'U+200D', 'zero width j +oiner', ], ['&#8206;', '->', 'lrm', 8206, 'U+200E', 'left-to-righ +t mark', ], ['&#8207;', '<-', 'rlm', 8207, 'U+200F', 'right-to-lef +t mark', ], ['–', '-', 'ndash', 8211, 'U+2013', 'en dash', + ], ['—', '-', 'mdash', 8212, 'U+2014', 'em dash', + ], ['‘', '\'', 'lsquo', 8216, 'U+2018', 'left single quotat +ion mark', ], ['’', '\'', 'rsquo', 8217, 'U+2019', 'right single quota +tion mark', ], ['‚', '\'', 'sbquo', 8218, 'U+201A', 'single low-9 quota +tion mark', ], ['“', '\"', 'ldquo', 8220, 'U+201C', 'left double quotat +ion mark', ], ['”', '\"', 'rdquo', 8221, 'U+201D', 'right double quota +tion mark', ], ['„', '\"', 'bdquo', 8222, 'U+201E', 'double low-9 quota +tion mark', ], ['†', '+', 'dagger', 8224, 'U+2020', 'dagger', + ], ['‡', '++', 'Dagger', 8225, 'U+2021', 'double dagger', + ], ['•', chr(183), 'bull', 8226, 'U+2022', 'bullet = black sma +ll circle', ], ['…', '...', 'hellip', 8230, 'U+2026', 'horizontal ellipsi +s = three dot leader', ], ['‰', '%%', 'permil', 8240, 'U+2030', 'per mille sign', + ], ['&#8242;', '\'', 'prime', 8242, 'U+2032', 'prime = minu +tes = feet', ], ['‹', '<', 'lsaquo', 8249, 'U+2039', 'single left-pointi +ng angle quotation mark', ], ['›', '>', 'rsaquo', 8250, 'U+203A', 'single right-point +ing angle quotation mark', ], ['&#8254;', '', 'oline', 8254, 'U+203E', 'overline = s +pacing overscore', ], ['&#8260;', '/', 'frasl', 8260, 'U+2044', 'fraction sla +sh', ], ['€', ' euro ', 'euro', 8364, 'U+20AC', 'euro sign', + ], ['&#8465;', 'I', 'image', 8465, 'U+2111', 'blackletter +capital I = imaginary part', ], ['&#8472;', 'P', 'weierp', 8472, 'U+2118', 'script capit +al P = power set= Weierstrass p', ], ['&#8476;', 'R', 'real', 8476, 'U+211C', 'blackletter +capital R = real part symbol', ], ['™', '(tm)', 'trade', 8482, 'U+2122', 'trade mark sign', + ], ['&#8501;', '', 'alefsym', 8501, 'U+2135', 'alef symbol += first transfinite cardinal', ], ['&#8592;', '<-', 'larr', 8592, 'U+2190', 'leftwards ar +row', ], ['&#8593;', '', 'uarr', 8593, 'U+2191', 'upwards arro +w', ], ['&#8594;', '->', 'rarr', 8594, 'U+2192', 'rightwards a +rrow', ], ['&#8595;', '', 'darr', 8595, 'U+2193', 'downwards ar +row', ], ['&#8596;', '', 'harr', 8596, 'U+2194', 'left right a +rrow', ], ['&#8629;', '<-', 'crarr', 8629, 'U+21B5', 'downwards ar +row with corner leftwards= carriage return', ], ['&#8656;', '<=', 'lArr', 8656, 'U+21D0', 'leftwards do +uble arrow', ], ['&#8657;', '', 'uArr', 8657, 'U+21D1', 'upwards doub +le arrow', ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['&#8658;', '=>', 'rArr', 8658, 'U+21D2', 'rightwards d +ouble arrow', ], ['&#8659;', '', 'dArr', 8659, 'U+21D3', 'downwards do +uble arrow', ], ['&#8704;', ' foreach ', 'forall', 8704, 'U+2200', 'for all', + ], ['&#8706;', '', 'part', 8706, 'U+2202', 'partial diff +erential', ], ['&#8707;', '', 'exist', 8707, 'U+2203', 'there exists +', ], ['&#8709;', '', 'empty', 8709, 'U+2205', 'empty set = +null set = diameter', ], ['&#8711;', '', 'nabla', 8711, 'U+2207', 'nabla = back +ward difference', ], ['&#8712;', '', 'isin', 8712, 'U+2208', 'element of', + ], ['&#8713;', '', 'notin', 8713, 'U+2209', 'not an eleme +nt of', ], ['&#8715;', '', 'ni', 8715, 'U+220B', 'contains as +member', ], ['&#8719;', '', 'prod', 8719, 'U+220F', 'n-ary produc +t = product sign', ], ['&#8721;', '', 'sum', 8721, 'U+2211', 'n-ary sumati +on', ], ['&#8722;', '-', 'minus', 8722, 'U+2212', 'minus sign', + ], ['&#8727;', '*', 'lowast', 8727, 'U+2217', 'asterisk ope +rator', ], ['&#8730;', '', 'radic', 8730, 'U+221A', 'square root += radical sign', ], ['&#8733;', '', 'prop', 8733, 'U+221D', 'proportional + to', ], ['&#8734;', '', 'infin', 8734, 'U+221E', 'infinity', + ], ['&#8736;', '', 'ang', 8736, 'U+2220', 'angle', + ], ['&#8743;', ' AND ', 'and', 8743, 'U+2227', 'logical and += wedge', ], ['&#8744;', ' OR ', 'or', 8744, 'U+2228', 'logical or = + vee', ], ['&#8745;', '', 'cap', 8745, 'U+2229', 'intersection + = cap', ], ['&#8746;', '', 'cup', 8746, 'U+222A', 'union = cup' +, ], ['&#8756;', '', 'there4', 8756, 'U+2234', 'therefore', + ], ['&#8764;', '~', 'sim', 8764, 'U+223C', 'tilde operat +or = varies with = similar to', ], ['&#8773;', '~', 'cong', 8773, 'U+2245', 'approximatel +y equal to', ], ['&#8776;', '', 'asymp', 8776, 'U+2248', 'almost equal + to = asymptotic to', ], ['&#8800;', '<>', 'ne', 8800, 'U+2260', 'not equal to +', ], ['&#8801;', '', 'equiv', 8801, 'U+2261', 'identical to +', ], ['&#8804;', '<=', 'le', 8804, 'U+2264', 'less-than or + equal to', ], ['&#8805;', '>=', 'ge', 8805, 'U+2265', 'greater-than + or equal to', ], ['&#8834;', '', 'sub', 8834, 'U+2282', 'subset of', + ], ['&#8835;', '', 'sup', 8835, 'U+2283', 'superset of' +, ], ['&#8836;', '', 'nsub', 8836, 'U+2284', 'not a subset + of', ], ['&#8838;', '', 'sube', 8838, 'U+2286', 'subset of or + equal to', ], ['&#8839;', '', 'supe', 8839, 'U+2287', 'superset of +or equal to', ], ['&#8853;', '', 'oplus', 8853, 'U+2295', 'circled plus + = direct sum', ], ['&#8855;', '', 'otimes', 8855, 'U+2297', 'circled time +s = vector product', ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ ['&#8869;', '', 'perp', 8869, 'U+22A5', 'up tack = or +thogonal to = perpendicular', ], ['&#8901;', chr(177), 'sdot', 8901, 'U+22C5', 'dot operator +', ], ['&#8968;', '', 'lceil', 8968, 'U+2308', 'left ceiling + = apl upstile', ], ['&#8969;', '', 'rceil', 8969, 'U+2309', 'right ceilin +g', ], ['&#8970;', '', 'lfloor', 8970, 'U+230A', 'left floor = + apl downstile', ], ['&#8971;', '', 'rfloor', 8971, 'U+230B', 'right floor' +, ], ['&#9001;', '<', 'lang', 9001, 'U+2329', 'left-pointin +g angle bracket = bra', ], ['&#9674;', '', 'loz', 9674, 'U+25CA', 'lozenge', + ], ['&#9824;', '', 'spades', 9824, 'U+2660', 'black spade +suit', ], ['&#9827;', '', 'clubs', 9827, 'U+2663', 'black club s +uit = shamrock', ], ['&#9829;', '', 'hearts', 9829, 'U+2665', 'black heart +suit = valentine', ], ['&#9830;', '', 'diams', 9830, 'U+2666', 'black diamon +d suit', ], # ------------------------------------------------------------------ +------------------------------ # 0 1 2 3 4 5 # char equiv entity entity codepoint description # ------------------------------------------------------------------ +------------------------------ );