Re: regex quotes character class

G'day igoryonya,

"I think, I used to stumble upon a character class, or a property ... can't find any information about it now ... something like ... [[:quote:]] or \p{Quote} ..."

I suspect you were looking in one or both of perlrecharclass or perlrebackslash; however, neither [[:quote:]] nor \p{Quote} exist (as far as I can tell).

I see ++soonix has indicated \p{QMark} and \p{Quotation_Mark} in perluniprops.

On a number of occasions in the past, I have had to identify Unicode properties (which could be for spaces, combining characters, specific scripts, and so on). I've tried the table in the "perluniprops: Properties accessible through \p{} and \P{}" section but find it to be a hard slog; for instance, a case-insensitive search for "quote" does find some matches but not the \p{Quotation_Mark} or \p{QMark} that you actually wanted.

I have found the best tool to be the core module Unicode::UCD. This has many functions which can help you identify properties (as well as find a lot of other information). I've put together a script to showcase some of that module's functionality: I'd recommend at least skimming the documentation to get a feel for the other functions that are available.

Much of the code in the script I'd probably just do from the command line; however, it seemed easier to lump it all together for the purposes of the current post.

#!/usr/bin/env perl

use 5.030;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};

use Data::Dump;
use Unicode::UCD qw{
    charprops_all charprop prop_aliases
};

say 'Find properties of interest for a quote:';
dd charprops_all(ord '"');

say '-' x 40;
say 'Find aliases for "Quotation_Mark":';
my @aliases = prop_aliases('Quotation_Mark');
say join "\n", @aliases;
say '-' x 40;

no warnings 'qw';
my @chars
    = qw{" ' ` ~ X ‘ ’ ‚ ‛ , “ ” ‟ „ « » < >};
use warnings 'qw';

say 'Check UCD vs. regex properties:';

for my $prop (@aliases) {
    say '=' x 40;
    say "Property: $prop";
    say '=' x 40;

    for my $char (@chars) {
        my $cp = sprintf 'U+%04x', ord $char;
        say "Character:  $char";
        say "Code point: $cp";
        my $qmark_prop = charprop($cp, $prop);
        say "$prop: $qmark_prop";
        my $re_prop
            = $char =~ /^\p{$prop}$/
            ? 'Yes' : 'No';
        say "Check regex: $re_prop";
        say 'UCD & RE match: ',
            $qmark_prop eq $re_prop
            ? 'Yes' : '!!! No !!!';
        say '-' x 40;
    }
}

[Aside: Yes, <code> tags are generally preferred but, when the code contains Unicode characters, <pre> tags stop those characters from being turned into entity references (e.g. ‛ in your OP). I've also bunched up the code somewhat as <pre> tags won't wrap the code around like you'd get with <code> tags.]

Notes:

Note the "use 5.030; at the start. Versions of Perl typically seem to keep up with Unicode versions (some are just one version behind). Perl v5.30 supports the current Unicode v12.1 (see "perl5300delta: Unicode 12.1 is supported"). Unicode has a "BETA Unicode® 13.0.0" and the development Perl v5.31.9 supports that (see "perldelta (5.31.9): Unicode 13.0 (beta) is supported").
The list of characters in @chars is somewhat arbitrary. It includes a number of non-quotes for testing; there's also a comma which, at least to me, appears identical to '‚' (U+201A SINGLE LOW-9 QUOTATION MARK).

Here's an extract of the output. The full output, which is rather long, is in the spoiler below.

Find properties of interest for a quote:
{
  ...
  Quotation_Mark               => "Yes",
  ...
}
----------------------------------------
Find aliases for "Quotation_Mark":
QMark
Quotation_Mark
----------------------------------------
Check UCD vs. regex properties:
========================================
Property: QMark
========================================
Character:  "
Code point: U+0022
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
----------------------------------------
Character:  X
Code point: U+0058
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:  ‘
Code point: U+2018
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
========================================
Property: Quotation_Mark
========================================
... same as QMark ...

Full output:

— Ken

Comment on Re: regex quotes character class Select or Download Code