in reply to regex quotes character class
You could use the following:
[\p{QMark}`\N{U+275B}-\N{U+275E}\N{U+1F676}-\N{U+1F678}\N{U+2826}\N{28 +34}]
What follows explains how this was derived.
for q in \" \' « » ‘ ’ ‛ “ ” ❝ ❞ 🙶 🙷 \`; do uniprops --all --single -- "$q" >"props-$q" done
The programs uniprops and unichars (used later) are provided by Unicode::Tussle.
Let's collect what we have.
perl -e' use 5.014; use warnings; my %props; while (<>) { chomp; ++$props{$_}; } say "$props{$_} $_" for sort { $props{$b} <=> $props{$a} || $a cmp $b } keys(%props); ' props-*
The list is long, but a lot are redundant (aliases and short forms).
[It would be nice if we could tell it to output just one form of equivalent forms!]
We consult perluniprops and see that no interesting properly matches all 14. This is not surprising, since we have charcters from two General Categories.
9 Punctuation 5 Symbol
The 9 punctuation characters all match \p{Quotation_Mark} aka \p{QMark}! This is the full set of quotation marks:
$ unichars -au '\p{QMark}' | cat
" U+00022 QUOTATION MARK
' U+00027 APOSTROPHE
« U+000AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» U+000BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
‘ U+02018 LEFT SINGLE QUOTATION MARK
’ U+02019 RIGHT SINGLE QUOTATION MARK
‚ U+0201A SINGLE LOW-9 QUOTATION MARK
‛ U+0201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
“ U+0201C LEFT DOUBLE QUOTATION MARK
” U+0201D RIGHT DOUBLE QUOTATION MARK
„ U+0201E DOUBLE LOW-9 QUOTATION MARK
‟ U+0201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
‹ U+02039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
› U+0203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
⹂ U+02E42 DOUBLE LOW-REVERSED-9 QUOTATION MARK
「 U+0300C LEFT CORNER BRACKET
」 U+0300D RIGHT CORNER BRACKET
『 U+0300E LEFT WHITE CORNER BRACKET
』 U+0300F RIGHT WHITE CORNER BRACKET
〝 U+0301D REVERSED DOUBLE PRIME QUOTATION MARK
〞 U+0301E DOUBLE PRIME QUOTATION MARK
〟 U+0301F LOW DOUBLE PRIME QUOTATION MARK
﹁ U+0FE41 PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
﹂ U+0FE42 PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
﹃ U+0FE43 PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
﹄ U+0FE44 PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
" U+0FF02 FULLWIDTH QUOTATION MARK
' U+0FF07 FULLWIDTH APOSTROPHE
「 U+0FF62 HALFWIDTH LEFT CORNER BRACKET
」 U+0FF63 HALFWIDTH RIGHT CORNER BRACKET
The 5 symbol characters aren't in any useful category. The symbols are:
$ grep -L ^QMark$ props-* \ | perl -CS -ne' use 5.014; use warnings; use charnames qw( ); s/^props-//; $_ = ord($_); printf "U+%05X %s\n", $_, charnames::viacode($_); ' \ | sort U+00060 GRAVE ACCENT U+0275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT U+0275E HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT U+1F676 SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT U+1F677 SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
So you could use
[\p{QMark}`\N{U+275D}\N{U+275E}\N{U+1F676}\N{U+1F677}]
Four of those you listed have "QUOTATION MARK" in their name, so why aren't they matched by \p{QMark}?
Well, they're actually "QUOTATION MARK ORNAMENT". U+275D, U+275E, U+1F676 and U+1F677 are all dingbats (emojis from before emojis was a word, kinda). They're not meant for use in text.
There are three more of these:
U+0275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT U+0275C HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT U+1F678 SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
Finally, this table also points out two braille characters you might want to include.
U+02826 BRAILLE PATTER DOTS-236 U+02834 BRAILLE PATTER DOTS-356
Update: Added the first section (the summary/"tl;dr") and the last section about why the 5 aren't quotation marks. Added the suggestion for additions to the list. Small wording tweaks.
|
|---|