comment on

G'day thanos1983,

I see your condition issue has been resolved. I'm currently working on a private project in a similar area (in my case, it is intended to handle any Unicode® character). This uses the \p{} construct to determine character type. You may be interested in this approach.

I've posted test code below. Be aware that this is a work-in-progress: the code I've posted works fine but is incomplete. There's more detailed Notes after the code.

#!/usr/bin/env perl

use 5.025009;
use strict;
use warnings;
use utf8;
use open IO => qw{:encoding(utf8) :std};
use charnames ':full';

BEGIN {
    my @types = qw{CONTROL PRINT COMBINE UNKNOWN};
    eval 'use enum @types';
    sub type_name { $types[$_[0]] }
}

use Test::More;

my @tests = (
    [ '0000' => CONTROL ],
    [ '0009' => CONTROL ],
    [ '000a' => CONTROL ],
    [ '0020' => PRINT ],
    [ '0021' => PRINT ],
    [ '0030' => PRINT ],
    [ '0040' => PRINT ],
    [ '0041' => PRINT ],
    [ '0060' => PRINT ],
    [ '0061' => PRINT ],
    [ '007e' => PRINT ],
    [ '007f' => CONTROL ],
    [ '00a0' => PRINT ],
    [ '0300' => COMBINE ],
    [ '034f' => COMBINE ],
    [ '2000' => PRINT ],
    [ '200d' => CONTROL ],
    [ '2028' => CONTROL ],
    [ '2029' => CONTROL ],
    [ 'fe00' => CONTROL ],
    [ '1f3fb' => PRINT ],
    [ 'e0100' => CONTROL ],
    [ '10ffff' => CONTROL ],
);

plan tests => scalar @tests;

for my $test (@tests) {
    my ($input, $exp) = $test->@*;

    is(type_of($input), $exp,
        "Checking '@{[sprintf q{%04X}, hex $input]}'"
        . " is of type '@{[type_name($exp)]}'"
        . " (Got: '@{[type_name(type_of($input))]}')."
    );
}

sub type_of {
    my ($input) = @_;

    my $char = chr hex $input;

    return CONTROL  if $char =~ / [ \p{C} \p{Zl} \p{Zp} \p{VS} ] /xx;

    return PRINT    if $char =~ / [ \p{L} \p{N} \p{P} \p{S} \p{Zs} ] /
+xx;

    return COMBINE  if $char =~ / [ \p{M} ] /xx;

    return UNKNOWN;
}
[download]

Notes:

I required the Perl (** DEVELOPER RELEASE **) version 5.25.9 because:
- I wanted Unicode v9.0 support (actually 5.25.3 is sufficient for that, see http://search.cpan.org/~shay/perl-5.25.3/pod/perldelta.pod#Unicode_9.0_is_now_supported).
- I like the double-x regex modifier (//xx). This allows whitespace in character classes, as well as the rest of the regex: in my opinion, this greatly improves readability. Be aware that this feature appears to have been added and removed at various times; for instance, it was removed in 5.25.1 (http://search.cpan.org/~xsawyerx/perl-5.25.1/pod/perldelta.pod#qr//xx_is_no_longer_permissible).
- Purely pragmatic reasons: that's the only 5.25.x version I currently have installed.
The utf8, open and charnames pragmata are not needed for the code in its present state: those are for future development. And strict doesn't need to be explicitly stated: it's implicit with any use VERSION where VERSION >= 5.12.0.
enum is a CPAN module.
perluniprops lists all the \p{} constructs I've used. The descriptions in http://www.unicode.org/reports/tr44/#Property_Values are more detailed.
The types I've used should be mostly, self-explanatory. In brief:
- CONTROL: don't print the character as-is; print a representation of its code point.
- PRINT: print the character.
- COMBINE: a combining character; print in combination with another character (I typically use U+25CC for this, e.g. ◌̀ renders as ◌̀).
- UNKNOWN: traps anything I've failed to match.

All tests pass. Here's an extract of the output:

1..23
ok 1 - Checking '0000' is of type 'CONTROL' (Got: 'CONTROL').
...
ok 23 - Checking '10FFFF' is of type 'CONTROL' (Got: 'CONTROL').
[download]

See also Unicode::UCD. I found this useful for checking property values of individual characters.

— Ken

In reply to Re: How to detect non printable characters and non white space characters? [\p{} character classes] by kcott
in thread How to detect non printable characters and non white space characters? [RESOLVED] by thanos1983

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.