in reply to regex quotes character class

G'day igoryonya,

"I think, I used to stumble upon a character class, or a property ... can't find any information about it now ... something like ... [[:quote:]] or \p{Quote} ..."

I suspect you were looking in one or both of perlrecharclass or perlrebackslash; however, neither [[:quote:]] nor \p{Quote} exist (as far as I can tell).

I see ++soonix has indicated \p{QMark} and \p{Quotation_Mark} in perluniprops.

On a number of occasions in the past, I have had to identify Unicode properties (which could be for spaces, combining characters, specific scripts, and so on). I've tried the table in the "perluniprops: Properties accessible through \p{} and \P{}" section but find it to be a hard slog; for instance, a case-insensitive search for "quote" does find some matches but not the \p{Quotation_Mark} or \p{QMark} that you actually wanted.

I have found the best tool to be the core module Unicode::UCD. This has many functions which can help you identify properties (as well as find a lot of other information). I've put together a script to showcase some of that module's functionality: I'd recommend at least skimming the documentation to get a feel for the other functions that are available.

Much of the code in the script I'd probably just do from the command line; however, it seemed easier to lump it all together for the purposes of the current post.

#!/usr/bin/env perl

use 5.030;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};

use Data::Dump;
use Unicode::UCD qw{
    charprops_all charprop prop_aliases
};

say 'Find properties of interest for a quote:';
dd charprops_all(ord '"');

say '-' x 40;
say 'Find aliases for "Quotation_Mark":';
my @aliases = prop_aliases('Quotation_Mark');
say join "\n", @aliases;
say '-' x 40;

no warnings 'qw';
my @chars
    = qw{" ' ` ~ X ‘ ’ ‚ ‛ , “ ” ‟ „ ซ ป < >};
use warnings 'qw';

say 'Check UCD vs. regex properties:';

for my $prop (@aliases) {
    say '=' x 40;
    say "Property: $prop";
    say '=' x 40;

    for my $char (@chars) {
        my $cp = sprintf 'U+%04x', ord $char;
        say "Character:  $char";
        say "Code point: $cp";
        my $qmark_prop = charprop($cp, $prop);
        say "$prop: $qmark_prop";
        my $re_prop
            = $char =~ /^\p{$prop}$/
            ? 'Yes' : 'No';
        say "Check regex: $re_prop";
        say 'UCD & RE match: ',
            $qmark_prop eq $re_prop
            ? 'Yes' : '!!! No !!!';
        say '-' x 40;
    }
}

[Aside: Yes, <code> tags are generally preferred but, when the code contains Unicode characters, <pre> tags stop those characters from being turned into entity references (e.g. &#8219; in your OP). I've also bunched up the code somewhat as <pre> tags won't wrap the code around like you'd get with <code> tags.]

Notes:

Here's an extract of the output. The full output, which is rather long, is in the spoiler below.

Find properties of interest for a quote:
{
  ...
  Quotation_Mark               => "Yes",
  ...
}
----------------------------------------
Find aliases for "Quotation_Mark":
QMark
Quotation_Mark
----------------------------------------
Check UCD vs. regex properties:
========================================
Property: QMark
========================================
Character:  "
Code point: U+0022
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
----------------------------------------
Character:  X
Code point: U+0058
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  ‘
Code point: U+2018
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
========================================
Property: Quotation_Mark
========================================
... same as QMark ...

Full output:

Find properties of interest for a quote:
{
  Age                          => "V1_1",
  Alphabetic                   => "No",
  ASCII_Hex_Digit              => "No",
  Bidi_Class                   => "Other_Neutral",
  Bidi_Control                 => "No",
  Bidi_Mirrored                => "No",
  Bidi_Mirroring_Glyph         => "",
  Bidi_Paired_Bracket          => "",
  Bidi_Paired_Bracket_Type     => "None",
  Block                        => "Basic_Latin",
  Canonical_Combining_Class    => "Not_Reordered",
  Case_Folding                 => "\"",
  Case_Ignorable               => "No",
  Cased                        => "No",
  Changes_When_Casefolded      => "No",
  Changes_When_Casemapped      => "No",
  Changes_When_Lowercased      => "No",
  Changes_When_NFKC_Casefolded => "No",
  Changes_When_Titlecased      => "No",
  Changes_When_Uppercased      => "No",
  Composition_Exclusion        => "No",
  Dash                         => "No",
  Decomposition_Mapping        => "\"",
  Decomposition_Type           => "None",
  Default_Ignorable_Code_Point => "No",
  Deprecated                   => "No",
  Diacritic                    => "No",
  East_Asian_Width             => "Narrow",
  Equivalent_Unified_Ideograph => "",
  Extender                     => "No",
  Full_Composition_Exclusion   => "No",
  General_Category             => "Other_Punctuation",
  Grapheme_Base                => "Yes",
  Grapheme_Cluster_Break       => "Other",
  Grapheme_Extend              => "No",
  Hangul_Syllable_Type         => "Not_Applicable",
  Hex_Digit                    => "No",
  Hyphen                       => "No",
  ID_Continue                  => "No",
  ID_Start                     => "No",
  Ideographic                  => "No",
  IDS_Binary_Operator          => "No",
  IDS_Trinary_Operator         => "No",
  Indic_Positional_Category    => "NA",
  Indic_Syllabic_Category      => "Other",
  ISO_Comment                  => "",
  Join_Control                 => "No",
  Joining_Group                => "No_Joining_Group",
  Joining_Type                 => "Non_Joining",
  Line_Break                   => "Quotation",
  Logical_Order_Exception      => "No",
  Lowercase                    => "No",
  Lowercase_Mapping            => "\"",
  Math                         => "No",
  Name                         => "QUOTATION MARK",
  Name_Alias                   => "",
  NFC_Quick_Check              => "Yes",
  NFD_Quick_Check              => "Yes",
  NFKC_Casefold                => "\"",
  NFKC_Quick_Check             => "Yes",
  NFKD_Quick_Check             => "Yes",
  Noncharacter_Code_Point      => "No",
  Numeric_Type                 => "None",
  Numeric_Value                => NaN,
  Pattern_Syntax               => "Yes",
  Pattern_White_Space          => "No",
  Prepended_Concatenation_Mark => "No",
  Present_In                   => 1.1,
  Quotation_Mark               => "Yes",
  Radical                      => "No",
  Regional_Indicator           => "No",
  Script                       => "Common",
  Script_Extensions            => "Common",
  Sentence_Break               => "Close",
  Sentence_Terminal            => "No",
  Simple_Case_Folding          => "\"",
  Simple_Lowercase_Mapping     => "\"",
  Simple_Titlecase_Mapping     => "\"",
  Simple_Uppercase_Mapping     => "\"",
  Soft_Dotted                  => "No",
  Terminal_Punctuation         => "No",
  Titlecase_Mapping            => "\"",
  Unicode_1_Name               => "",
  Unified_Ideograph            => "No",
  Uppercase                    => "No",
  Uppercase_Mapping            => "\"",
  Variation_Selector           => "No",
  Vertical_Orientation         => "Rotated",
  White_Space                  => "No",
  Word_Break                   => "Double_Quote",
  XID_Continue                 => "No",
  XID_Start                    => "No",
}
----------------------------------------
Find aliases for "Quotation_Mark":
QMark
Quotation_Mark
----------------------------------------
Check UCD vs. regex properties:
========================================
Property: QMark
========================================
Character:  "
Code point: U+0022
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  '
Code point: U+0027
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  `
Code point: U+0060
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  ~
Code point: U+007e
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  X
Code point: U+0058
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  ‘
Code point: U+2018
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ’
Code point: U+2019
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ‚
Code point: U+201a
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ‛
Code point: U+201b
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ,
Code point: U+002c
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  “
Code point: U+201c
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ”
Code point: U+201d
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ‟
Code point: U+201f
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  „
Code point: U+201e
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ซ
Code point: U+00ab
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ป
Code point: U+00bb
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  <
Code point: U+003c
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  >
Code point: U+003e
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
========================================
Property: Quotation_Mark
========================================
Character:  "
Code point: U+0022
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  '
Code point: U+0027
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  `
Code point: U+0060
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  ~
Code point: U+007e
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  X
Code point: U+0058
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  ‘
Code point: U+2018
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ’
Code point: U+2019
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ‚
Code point: U+201a
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ‛
Code point: U+201b
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ,
Code point: U+002c
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  “
Code point: U+201c
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ”
Code point: U+201d
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ‟
Code point: U+201f
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  „
Code point: U+201e
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ซ
Code point: U+00ab
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  ป
Code point: U+00bb
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:  <
Code point: U+003c
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  >
Code point: U+003e
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------

— Ken