Re: regex quotes character class

You could use the following:

[\p{QMark}`\N{U+275B}-\N{U+275E}\N{U+1F676}-\N{U+1F678}\N{U+2826}\N{28
+34}]
[download]

What follows explains how this was derived.

Let's start by collecting some info about each character.

for q in \" \' « » ‘ ’ ‛ “ ” ❝ ❞ 🙶 🙷 \`; do
   uniprops --all --single -- "$q" >"props-$q"
done

The programs uniprops and unichars (used later) are provided by Unicode::Tussle.

Let's collect what we have.

perl -e'
   use 5.014;
   use warnings;

   my %props; while (<>) { chomp; ++$props{$_}; }

   say "$props{$_} $_"
      for
         sort { $props{$b} <=> $props{$a} || $a cmp $b }
            keys(%props);
' props-*
[download]

The list is long, but a lot are redundant (aliases and short forms).

[It would be nice if we could tell it to output just one form of equivalent forms!]

14 All
14 Any
14 Assigned
14 BC=ON
14 Bidi_Class=ON
14 Bidi_Class=Other_Neutral
14 Bidi_Paired_Bracket_Type=None
14 CCC=NR
14 Canonical_Combining_Class=0
14 Canonical_Combining_Class=NR
14 Canonical_Combining_Class=Not_Reordered
14 Common
14 DT=None
14 Decomposition_Type=None
14 GCB=XX
14 GrBase
14 Gr_Base
14 Graph
14 Grapheme_Base
14 Grapheme_Cluster_Break=Other
14 Grapheme_Cluster_Break=XX
14 HST=NA
14 Hangul_Syllable_Type=NA
14 Hangul_Syllable_Type=Not_Applicable
14 IN=10.0
14 IN=11.0
14 IN=12.0
14 IN=12.1
14 IN=7.0
14 IN=8.0
14 IN=9.0
14 InPC=NA
14 InSC=Other
14 Indic_Positional_Category=NA
14 Indic_Syllabic_Category=Other
14 JG=NoJoiningGroup
14 JT=U
14 Joining_Group=No_Joining_Group
14 Joining_Type=Non_Joining
14 Joining_Type=U
14 NT=None
14 NV=NaN
14 Numeric_Type=None
14 Numeric_Value=NaN
14 Present_In=10.0
14 Present_In=11.0
14 Present_In=12.0
14 Present_In=12.1
14 Present_In=7.0
14 Present_In=8.0
14 Present_In=9.0
14 Present_In=V10_0
14 Present_In=V11_0
14 Present_In=V12_0
14 Present_In=V12_1
14 Present_In=V7_0
14 Present_In=V8_0
14 Present_In=V9_0
14 Print
14 SC=Zyyy
14 Script=Common
14 Script=Zyyy
14 Script_Extensions=Common
14 Script_Extensions=Zyyy
14 Scx=Zyyy
14 Unicode
14 X_POSIX_Graph
14 X_POSIX_Print
14 Zyyy

13 LB=QU
13 Line_Break=QU
13 Line_Break=Quotation
13 SB=CL
13 Sentence_Break=CL
13 Sentence_Break=Close

12 Age=1.1
12 Age=V1_1
12 IN=1.1
12 IN=2.0
12 IN=2.1
12 IN=3.0
12 IN=3.1
12 IN=3.2
12 IN=4.0
12 IN=4.1
12 IN=5.0
12 IN=5.1
12 IN=5.2
12 IN=6.0
12 IN=6.1
12 IN=6.2
12 IN=6.3
12 PatSyn
12 Pat_Syn
12 Pattern_Syntax
12 Present_In=1.1
12 Present_In=2.0
12 Present_In=2.1
12 Present_In=3.0
12 Present_In=3.1
12 Present_In=3.2
12 Present_In=4.0
12 Present_In=4.1
12 Present_In=5.0
12 Present_In=5.1
12 Present_In=5.2
12 Present_In=6.0
12 Present_In=6.1
12 Present_In=6.2
12 Present_In=6.3
12 Present_In=V2_0
12 Present_In=V2_1
12 Present_In=V3_0
12 Present_In=V3_1
12 Present_In=V3_2
12 Present_In=V4_0
12 Present_In=V4_1
12 Present_In=V5_0
12 Present_In=V5_1
12 Present_In=V5_2
12 Present_In=V6_0
12 Present_In=V6_1
12 Present_In=V6_2
12 Present_In=V6_3

10 Vertical_Orientation=R
10 Vertical_Orientation=Rotated
10 Vo=R
10 WB=XX
10 Word_Break=Other
10 Word_Break=XX
10 X_POSIX_Punct

9 Is_Punctuation
9 P
9 Punct
9 Punctuation
9 QMark
9 Quotation_Mark
9 \pP

7 East_Asian_Width=Neutral

5 BLK=Punctuation
5 Block=General_Punctuation
5 Block=Punctuation
5 General_Punctuation
5 InPunctuation
5 S
5 Symbol
5 \pS

4 CI
4 Case_Ignorable
4 EA=A
4 East_Asian_Width=A
4 East_Asian_Width=Ambiguous
4 Initial_Punctuation
4 Other_Symbol
4 Pi
4 So
4 Vertical_Orientation=U
4 Vertical_Orientation=Upright
4 Vo=U
4 \p{Pi}
4 \p{So}

3 ASCII
3 BLK=ASCII
3 Basic_Latin
3 Block=ASCII
3 Block=Basic_Latin
3 EA=Na
3 East_Asian_Width=Na
3 East_Asian_Width=Narrow
3 Final_Punctuation
3 POSIX_Graph
3 POSIX_Print
3 POSIX_Punct
3 Pf
3 \p{Pf}

2 Age=7.0
2 Age=V7_0
2 BLK=Latin1
2 BidiM
2 Bidi_M
2 Bidi_Mirrored
2 Block=Dingbats
2 Block=Latin_1
2 Block=Latin_1_Sup
2 Block=Latin_1_Supplement
2 Block=Ornamental_Dingbats
2 Dingbats
2 InLatin1
2 Latin_1
2 Latin_1_Sup
2 Latin_1_Supplement
2 Ornamental_Dingbats
2 Other_Punctuation
2 Po
2 WB=MB
2 Word_Break=MB
2 Word_Break=MidNumLet
2 \p{Po}
1 Dia

1 Diacritic
1 LB=AL
1 Line_Break=AL
1 Line_Break=Alphabetic
1 Modifier_Symbol
1 SB=XX
1 Sentence_Break=Other
1 Sentence_Break=XX
1 Sk
1 U+0022 ‹"› \N{QUOTATION MARK}
1 U+0027 ‹'› \N{APOSTROPHE}
1 U+0060 ‹`› \N{GRAVE ACCENT}
1 U+00AB ‹«› \N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}
1 U+00BB ‹»› \N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}
1 U+1F676 ‹&#128630;› \N{SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATIO
+N MARK ORNAMENT}
1 U+1F677 ‹&#128631;› \N{SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK 
+ORNAMENT}
1 U+2018 ‹‘› \N{LEFT SINGLE QUOTATION MARK}
1 U+2019 ‹’› \N{RIGHT SINGLE QUOTATION MARK}
1 U+201B ‹&#8219;› \N{SINGLE HIGH-REVERSED-9 QUOTATION MARK}
1 U+201C ‹“› \N{LEFT DOUBLE QUOTATION MARK}
1 U+201D ‹”› \N{RIGHT DOUBLE QUOTATION MARK}
1 U+275D ‹&#10077;› \N{HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAME
+NT}
1 U+275E ‹&#10078;› \N{HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT}
1 WB=DQ
1 WB=SQ
1 Word_Break=DQ
1 Word_Break=Double_Quote
1 Word_Break=SQ
1 Word_Break=Single_Quote
1 \p{Sk}
[download]

We consult perluniprops and see that no interesting properly matches all 14. This is not surprising, since we have charcters from two General Categories.

9 Punctuation
5 Symbol
[download]

The 9 punctuation characters all match \p{Quotation_Mark} aka \p{QMark}! This is the full set of quotation marks:

$ unichars -au '\p{QMark}' | cat
‭ "  U+00022 QUOTATION MARK
‭ '  U+00027 APOSTROPHE
‭ «  U+000AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
‭ »  U+000BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
‭ ‘  U+02018 LEFT SINGLE QUOTATION MARK
‭ ’  U+02019 RIGHT SINGLE QUOTATION MARK
‭ ‚  U+0201A SINGLE LOW-9 QUOTATION MARK
‭ ‛  U+0201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
‭ “  U+0201C LEFT DOUBLE QUOTATION MARK
‭ ”  U+0201D RIGHT DOUBLE QUOTATION MARK
‭ „  U+0201E DOUBLE LOW-9 QUOTATION MARK
‭ ‟  U+0201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
‭ ‹  U+02039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
‭ ›  U+0203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
‭ ⹂  U+02E42 DOUBLE LOW-REVERSED-9 QUOTATION MARK
‭ 「 U+0300C LEFT CORNER BRACKET
‭ 」 U+0300D RIGHT CORNER BRACKET
‭ 『 U+0300E LEFT WHITE CORNER BRACKET
‭ 』 U+0300F RIGHT WHITE CORNER BRACKET
‭ 〝 U+0301D REVERSED DOUBLE PRIME QUOTATION MARK
‭ 〞 U+0301E DOUBLE PRIME QUOTATION MARK
‭ 〟 U+0301F LOW DOUBLE PRIME QUOTATION MARK
‭ ﹁ U+0FE41 PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
‭ ﹂ U+0FE42 PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
‭ ﹃ U+0FE43 PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
‭ ﹄ U+0FE44 PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
‭ ＂ U+0FF02 FULLWIDTH QUOTATION MARK
‭ ＇ U+0FF07 FULLWIDTH APOSTROPHE
‭ ｢  U+0FF62 HALFWIDTH LEFT CORNER BRACKET
‭ ｣  U+0FF63 HALFWIDTH RIGHT CORNER BRACKET

The 5 symbol characters aren't in any useful category. The symbols are:

$ grep -L ^QMark$ props-* \
| perl -CS -ne'
   use 5.014;
   use warnings;
   use charnames qw( );

   s/^props-//;
   $_ = ord($_);
   printf "U+%05X %s\n", $_, charnames::viacode($_);
' \
| sort
U+00060 GRAVE ACCENT
U+0275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
U+0275E HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
U+1F676 SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
U+1F677 SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
[download]

So you could use

[\p{QMark}`\N{U+275D}\N{U+275E}\N{U+1F676}\N{U+1F677}]
[download]

Four of those you listed have "QUOTATION MARK" in their name, so why aren't they matched by \p{QMark}?

Well, they're actually "QUOTATION MARK ORNAMENT". U+275D, U+275E, U+1F676 and U+1F677 are all dingbats (emojis from before emojis was a word, kinda). They're not meant for use in text.

There are three more of these:

U+0275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
U+0275C HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
U+1F678 SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
[download]

Finally, this table also points out two braille characters you might want to include.

U+02826 BRAILLE PATTER DOTS-236
U+02834 BRAILLE PATTER DOTS-356
[download]

Update: Added the first section (the summary/"tl;dr") and the last section about why the 5 aren't quotation marks. Added the suggestion for additions to the list. Small wording tweaks.

Comment on Re: regex quotes character class Select or Download Code