in reply to regex quotes character class
G'day igoryonya,
"I think, I used to stumble upon a character class, or a property ... can't find any information about it now ... something like ... [[:quote:]] or \p{Quote} ..."
I suspect you were looking in one or both of perlrecharclass or perlrebackslash; however, neither [[:quote:]] nor \p{Quote} exist (as far as I can tell).
I see ++soonix has indicated \p{QMark} and \p{Quotation_Mark} in perluniprops.
On a number of occasions in the past, I have had to identify Unicode properties (which could be for spaces, combining characters, specific scripts, and so on). I've tried the table in the "perluniprops: Properties accessible through \p{} and \P{}" section but find it to be a hard slog; for instance, a case-insensitive search for "quote" does find some matches but not the \p{Quotation_Mark} or \p{QMark} that you actually wanted.
I have found the best tool to be the core module Unicode::UCD. This has many functions which can help you identify properties (as well as find a lot of other information). I've put together a script to showcase some of that module's functionality: I'd recommend at least skimming the documentation to get a feel for the other functions that are available.
Much of the code in the script I'd probably just do from the command line; however, it seemed easier to lump it all together for the purposes of the current post.
#!/usr/bin/env perl
use 5.030;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};
use Data::Dump;
use Unicode::UCD qw{
charprops_all charprop prop_aliases
};
say 'Find properties of interest for a quote:';
dd charprops_all(ord '"');
say '-' x 40;
say 'Find aliases for "Quotation_Mark":';
my @aliases = prop_aliases('Quotation_Mark');
say join "\n", @aliases;
say '-' x 40;
no warnings 'qw';
my @chars
= qw{" ' ` ~ X ‛ , ‟ ซ ป < >};
use warnings 'qw';
say 'Check UCD vs. regex properties:';
for my $prop (@aliases) {
say '=' x 40;
say "Property: $prop";
say '=' x 40;
for my $char (@chars) {
my $cp = sprintf 'U+%04x', ord $char;
say "Character: $char";
say "Code point: $cp";
my $qmark_prop = charprop($cp, $prop);
say "$prop: $qmark_prop";
my $re_prop
= $char =~ /^\p{$prop}$/
? 'Yes' : 'No';
say "Check regex: $re_prop";
say 'UCD & RE match: ',
$qmark_prop eq $re_prop
? 'Yes' : '!!! No !!!';
say '-' x 40;
}
}
[Aside: Yes, <code> tags are generally preferred but, when the code contains Unicode characters, <pre> tags stop those characters from being turned into entity references (e.g. ‛ in your OP). I've also bunched up the code somewhat as <pre> tags won't wrap the code around like you'd get with <code> tags.]
Notes:
Here's an extract of the output. The full output, which is rather long, is in the spoiler below.
Find properties of interest for a quote:
{
...
Quotation_Mark => "Yes",
...
}
----------------------------------------
Find aliases for "Quotation_Mark":
QMark
Quotation_Mark
----------------------------------------
Check UCD vs. regex properties:
========================================
Property: QMark
========================================
Character: "
Code point: U+0022
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
----------------------------------------
Character: X
Code point: U+0058
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+2018
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
========================================
Property: Quotation_Mark
========================================
... same as QMark ...
Full output:
Find properties of interest for a quote:
{
Age => "V1_1",
Alphabetic => "No",
ASCII_Hex_Digit => "No",
Bidi_Class => "Other_Neutral",
Bidi_Control => "No",
Bidi_Mirrored => "No",
Bidi_Mirroring_Glyph => "",
Bidi_Paired_Bracket => "",
Bidi_Paired_Bracket_Type => "None",
Block => "Basic_Latin",
Canonical_Combining_Class => "Not_Reordered",
Case_Folding => "\"",
Case_Ignorable => "No",
Cased => "No",
Changes_When_Casefolded => "No",
Changes_When_Casemapped => "No",
Changes_When_Lowercased => "No",
Changes_When_NFKC_Casefolded => "No",
Changes_When_Titlecased => "No",
Changes_When_Uppercased => "No",
Composition_Exclusion => "No",
Dash => "No",
Decomposition_Mapping => "\"",
Decomposition_Type => "None",
Default_Ignorable_Code_Point => "No",
Deprecated => "No",
Diacritic => "No",
East_Asian_Width => "Narrow",
Equivalent_Unified_Ideograph => "",
Extender => "No",
Full_Composition_Exclusion => "No",
General_Category => "Other_Punctuation",
Grapheme_Base => "Yes",
Grapheme_Cluster_Break => "Other",
Grapheme_Extend => "No",
Hangul_Syllable_Type => "Not_Applicable",
Hex_Digit => "No",
Hyphen => "No",
ID_Continue => "No",
ID_Start => "No",
Ideographic => "No",
IDS_Binary_Operator => "No",
IDS_Trinary_Operator => "No",
Indic_Positional_Category => "NA",
Indic_Syllabic_Category => "Other",
ISO_Comment => "",
Join_Control => "No",
Joining_Group => "No_Joining_Group",
Joining_Type => "Non_Joining",
Line_Break => "Quotation",
Logical_Order_Exception => "No",
Lowercase => "No",
Lowercase_Mapping => "\"",
Math => "No",
Name => "QUOTATION MARK",
Name_Alias => "",
NFC_Quick_Check => "Yes",
NFD_Quick_Check => "Yes",
NFKC_Casefold => "\"",
NFKC_Quick_Check => "Yes",
NFKD_Quick_Check => "Yes",
Noncharacter_Code_Point => "No",
Numeric_Type => "None",
Numeric_Value => NaN,
Pattern_Syntax => "Yes",
Pattern_White_Space => "No",
Prepended_Concatenation_Mark => "No",
Present_In => 1.1,
Quotation_Mark => "Yes",
Radical => "No",
Regional_Indicator => "No",
Script => "Common",
Script_Extensions => "Common",
Sentence_Break => "Close",
Sentence_Terminal => "No",
Simple_Case_Folding => "\"",
Simple_Lowercase_Mapping => "\"",
Simple_Titlecase_Mapping => "\"",
Simple_Uppercase_Mapping => "\"",
Soft_Dotted => "No",
Terminal_Punctuation => "No",
Titlecase_Mapping => "\"",
Unicode_1_Name => "",
Unified_Ideograph => "No",
Uppercase => "No",
Uppercase_Mapping => "\"",
Variation_Selector => "No",
Vertical_Orientation => "Rotated",
White_Space => "No",
Word_Break => "Double_Quote",
XID_Continue => "No",
XID_Start => "No",
}
----------------------------------------
Find aliases for "Quotation_Mark":
QMark
Quotation_Mark
----------------------------------------
Check UCD vs. regex properties:
========================================
Property: QMark
========================================
Character: "
Code point: U+0022
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: '
Code point: U+0027
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: `
Code point: U+0060
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character: ~
Code point: U+007e
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character: X
Code point: U+0058
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+2018
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+2019
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201a
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ‛
Code point: U+201b
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ,
Code point: U+002c
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201c
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201d
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ‟
Code point: U+201f
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201e
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ซ
Code point: U+00ab
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ป
Code point: U+00bb
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: <
Code point: U+003c
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character: >
Code point: U+003e
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
========================================
Property: Quotation_Mark
========================================
Character: "
Code point: U+0022
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: '
Code point: U+0027
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: `
Code point: U+0060
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character: ~
Code point: U+007e
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character: X
Code point: U+0058
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+2018
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+2019
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201a
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ‛
Code point: U+201b
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ,
Code point: U+002c
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201c
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201d
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ‟
Code point: U+201f
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+201e
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ซ
Code point: U+00ab
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: ป
Code point: U+00bb
Quotation_Mark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
Character: <
Code point: U+003c
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character: >
Code point: U+003e
Quotation_Mark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
— Ken
|
|---|