igoryonya has asked for the wisdom of the Perl Monks concerning the following question:
or this:[[:quote:]]
Which would match any type of quote characters from any locale or unicode, i.e.:\p{Quote}
and whatever else, that's considered to be a quote."'ซป‛❝❞🙶🙷`
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: regex quotes character class
by soonix (Chancellor) on Jun 17, 2020 at 06:27 UTC | |
| [reply] [d/l] [select] |
by ikegami (Patriarch) on Jun 17, 2020 at 14:26 UTC | |
\p{Quotation_Mark=Yes} and its alias \p{QMark=Yes}, which can be simplified to \p{Quotation_Mark} and \p{QMark}, do indeed track the similarly-named Unicode character property. Characters with this property include *most* of the the characters listed in the linked table and *some* of the characters listed by the OP. | [reply] [d/l] [select] |
|
Re: regex quotes character class
by kcott (Archbishop) on Jun 17, 2020 at 11:11 UTC | |
G'day igoryonya, "I think, I used to stumble upon a character class, or a property ... can't find any information about it now ... something like ... [[:quote:]] or \p{Quote} ..." I suspect you were looking in one or both of perlrecharclass or perlrebackslash; however, neither [[:quote:]] nor \p{Quote} exist (as far as I can tell). I see ++soonix has indicated \p{QMark} and \p{Quotation_Mark} in perluniprops. On a number of occasions in the past, I have had to identify Unicode properties (which could be for spaces, combining characters, specific scripts, and so on). I've tried the table in the "perluniprops: Properties accessible through \p{} and \P{}" section but find it to be a hard slog; for instance, a case-insensitive search for "quote" does find some matches but not the \p{Quotation_Mark} or \p{QMark} that you actually wanted. I have found the best tool to be the core module Unicode::UCD. This has many functions which can help you identify properties (as well as find a lot of other information). I've put together a script to showcase some of that module's functionality: I'd recommend at least skimming the documentation to get a feel for the other functions that are available. Much of the code in the script I'd probably just do from the command line; however, it seemed easier to lump it all together for the purposes of the current post.
#!/usr/bin/env perl
use 5.030;
use warnings;
use utf8;
use open OUT => qw{:encoding(UTF-8) :std};
use Data::Dump;
use Unicode::UCD qw{
charprops_all charprop prop_aliases
};
say 'Find properties of interest for a quote:';
dd charprops_all(ord '"');
say '-' x 40;
say 'Find aliases for "Quotation_Mark":';
my @aliases = prop_aliases('Quotation_Mark');
say join "\n", @aliases;
say '-' x 40;
no warnings 'qw';
my @chars
= qw{" ' ` ~ X ‛ , ‟ ซ ป < >};
use warnings 'qw';
say 'Check UCD vs. regex properties:';
for my $prop (@aliases) {
say '=' x 40;
say "Property: $prop";
say '=' x 40;
for my $char (@chars) {
my $cp = sprintf 'U+%04x', ord $char;
say "Character: $char";
say "Code point: $cp";
my $qmark_prop = charprop($cp, $prop);
say "$prop: $qmark_prop";
my $re_prop
= $char =~ /^\p{$prop}$/
? 'Yes' : 'No';
say "Check regex: $re_prop";
say 'UCD & RE match: ',
$qmark_prop eq $re_prop
? 'Yes' : '!!! No !!!';
say '-' x 40;
}
}
[Aside: Yes, <code> tags are generally preferred but, when the code contains Unicode characters, <pre> tags stop those characters from being turned into entity references (e.g. ‛ in your OP). I've also bunched up the code somewhat as <pre> tags won't wrap the code around like you'd get with <code> tags.] Notes:
Here's an extract of the output. The full output, which is rather long, is in the spoiler below.
Find properties of interest for a quote:
{
...
Quotation_Mark => "Yes",
...
}
----------------------------------------
Find aliases for "Quotation_Mark":
QMark
Quotation_Mark
----------------------------------------
Check UCD vs. regex properties:
========================================
Property: QMark
========================================
Character: "
Code point: U+0022
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
----------------------------------------
Character: X
Code point: U+0058
QMark: No
Check regex: No
UCD & RE match: Yes
----------------------------------------
Character:
Code point: U+2018
QMark: Yes
Check regex: Yes
UCD & RE match: Yes
----------------------------------------
...
========================================
Property: Quotation_Mark
========================================
... same as QMark ...
Full output: <Reveal this spoiler or all in this thread>
— Ken | [reply] [d/l] [select] |
|
Re: regex quotes character class
by ikegami (Patriarch) on Jun 17, 2020 at 11:46 UTC | |
You could use the following:
What follows explains how this was derived. Let's start by collecting some info about each character. for q in \" \' ซ ป ‛ ❝ ❞ 🙶 🙷 \`; do uniprops --all --single -- "$q" >"props-$q" done The programs uniprops and unichars (used later) are provided by Unicode::Tussle. Let's collect what we have.
The list is long, but a lot are redundant (aliases and short forms). [It would be nice if we could tell it to output just one form of equivalent forms!] <Reveal this spoiler or all in this thread>
We consult perluniprops and see that no interesting properly matches all 14. This is not surprising, since we have charcters from two General Categories.
The 9 punctuation characters all match \p{Quotation_Mark} aka \p{QMark}! This is the full set of quotation marks:
$ unichars -au '\p{QMark}' | cat
" U+00022 QUOTATION MARK
' U+00027 APOSTROPHE
ซ U+000AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
ป U+000BB RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+02018 LEFT SINGLE QUOTATION MARK
U+02019 RIGHT SINGLE QUOTATION MARK
U+0201A SINGLE LOW-9 QUOTATION MARK
‛ U+0201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+0201C LEFT DOUBLE QUOTATION MARK
U+0201D RIGHT DOUBLE QUOTATION MARK
U+0201E DOUBLE LOW-9 QUOTATION MARK
‟ U+0201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK
U+02039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
U+0203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
⹂ U+02E42 DOUBLE LOW-REVERSED-9 QUOTATION MARK
「 U+0300C LEFT CORNER BRACKET
」 U+0300D RIGHT CORNER BRACKET
『 U+0300E LEFT WHITE CORNER BRACKET
』 U+0300F RIGHT WHITE CORNER BRACKET
〝 U+0301D REVERSED DOUBLE PRIME QUOTATION MARK
〞 U+0301E DOUBLE PRIME QUOTATION MARK
〟 U+0301F LOW DOUBLE PRIME QUOTATION MARK
﹁ U+0FE41 PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
﹂ U+0FE42 PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
﹃ U+0FE43 PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
﹄ U+0FE44 PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
" U+0FF02 FULLWIDTH QUOTATION MARK
' U+0FF07 FULLWIDTH APOSTROPHE
「 U+0FF62 HALFWIDTH LEFT CORNER BRACKET
」 U+0FF63 HALFWIDTH RIGHT CORNER BRACKET
The 5 symbol characters aren't in any useful category. The symbols are:
So you could use
Four of those you listed have "QUOTATION MARK" in their name, so why aren't they matched by \p{QMark}? Well, they're actually "QUOTATION MARK ORNAMENT". U+275D, U+275E, U+1F676 and U+1F677 are all dingbats (emojis from before emojis was a word, kinda). They're not meant for use in text. There are three more of these:
Finally, this table also points out two braille characters you might want to include.
Update: Added the first section (the summary/"tl;dr") and the last section about why the 5 aren't quotation marks. Added the suggestion for additions to the list. Small wording tweaks. | [reply] [d/l] [select] |