How to detect non printable characters and non white space characters? [RESOLVED]

thanos1983 has asked for the wisdom of the Perl Monks concerning the following question:

Hello again Monks,

Lately I have been bombing the forum with questions but no matter how much I experiment with my code I can not figure out the solution(s) to my problems this is why I keep asking questions over and over.

To the question, I have a hash of hashes with hundreds values on each hash. I want to use a negative condition on both detecting white space character or non printable character. In such a case I want to delete the element from the hash (if does not contain white space character or non printable character).

I tried the conditions separately and they work just fine, at least based on my experimentation examples. I need to combine them in a nested if because a value with no spaces it does not mean that it does not contain special characters.

Based on my research in order to detect non printable characters you can use either this /[^[:print:]]/g or this /[^[:ascii:]]/ regex expression found here (Finding out non ASCII Characters in the text)

Sample of code:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my $str = 'a[bdy]dfjaPÃ‘sdafÃœ';
my $str_2 = 'WAP';

my $hoh_ref = {
    hash_1 => { a => 'a[bdy]dfjaPÃ‘sdafÃœ',
        b => 'WAP' },
    hash_2 => { c => 'Te st' }
};

print Dumper $hoh_ref;

foreach my $key (sort keys %{$hoh_ref}) {
    foreach my $value (keys %{$$hoh_ref{$key}}) {
    # If not white space character or non printable character remove e
+lement
    if ($$hoh_ref{$key}{$value} !~ /[^[:print:]]/g ||
        $$hoh_ref{$key}{$value} !~ /\s/) {
        delete $$hoh_ref{$key}{$value};
    }
    elsif ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) {
        while ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) {
        print "Non Printable Characater:\t$&\n";
        }
    }
    }
}

print Dumper $hoh_ref;

__END__
$VAR1 = {
          'hash_2' => {
                        'c' => 'Te st'
                      },
          'hash_1' => {
                        'b' => 'WAP',
                        'a' => 'a[bdy]dfjaPÃ‘sdafÃœ'
                      }
        };
$VAR1 = {
          'hash_2' => {},
          'hash_1' => {}
        };
[download]

Desired output would be:

$VAR1 = {
          'hash_2' => {
                        'c' => 'Te st'
                      },
          'hash_1' => {
                        'a' => 'a[bdy]dfjaPÃ‘sdafÃœ'
                      }
        };
[download]

Thanks in advance for your time and effort.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Comment on How to detect non printable characters and non white space characters? [RESOLVED] Select or Download Code

Replies are listed 'Best First'.
Re: How to detect non printable characters and non white space characters? by choroba (Cardinal) on Feb 17, 2017 at 10:22 UTC
It seems you got De Morgan wrong: it should be `# If not white space character AND not non printable character remove +element` [download] That's probably because there are too many negations. If you have an `if` with `else` , using negation in the condition makes it really hard to understand. So, instead of `if ($$hoh_ref{$key}{$value} !~ /[^[:print:]]/g && $$hoh_ref{$key}{$value} !~ /\s/) { delete $$hoh_ref{$key}{$value}; } elsif ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { while ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { print "Non Printable Characater:\t$&\n"; } }` [download] you can use a bit simpler `if ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/) { # no /g neede +d while ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { print "Non Printable Characater:\t$&\n"; } } elsif ($$hoh_ref{$key}{$value} !~ /\s/) { delete $$hoh_ref{$key}{$value}; }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: How to detect non printable characters and non white space characters? by thanos1983 (Parson) on Feb 17, 2017 at 11:03 UTC
Hello choroba, I thought about it when I was implementing the solution to use and instead of or but it just did not make sense to me and it did not work when I was testing. You are absolutely right simpler the better. Thank you for your time and effort reading and replying to my question. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^3: How to detect non printable characters and non white space characters? by Marshall (Canon) on Feb 17, 2017 at 12:28 UTC
I likes the post from choroba++. A few minor suggestions: I show a different indenting style below, but that is no big deal to me. More important to me is: `$href->{key1}{key2}`I think that reads easier than use of `$$href{key1}{key2}.` My preference is to only use $$scalar_ref to de-reference a scalar value. I replaced $& with a simple $1 capture. There can be some performance issues with $& with Perl < 5.20. See perlre. My preference is not to use $& unless needed, and here it is not. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $str = 'a[bdy]dfjaPÃ‘sdafÃœ'; my $str_2 = 'WAP'; my $hoh_ref = { hash_1 => { a => 'a[bdy]dfjaPÃ‘sdafÃœ', b => 'WAP' }, hash_2 => { c => 'Te st' } }; print "Original Hash:\n"; print Dumper $hoh_ref; foreach my $main_key (sort keys %{$hoh_ref}) { foreach my $sub_key (keys %{$hoh_ref->{$main_key}}) { if ($hoh_ref->{$main_key}{$sub_key} =~ /[^[:print:]]/) { while ($hoh_ref->{$main_key}{$sub_key} =~ /([^[:print:]])/g) { print "Non Printable Character:\t$1\n"; } } elsif ($hoh_ref->{$main_key}{$sub_key} !~ /\s/) { delete $hoh_ref->{$main_key}{$sub_key}; } } } print "\nResult Hash:\n"; print Dumper $hoh_ref; __END__ Original Hash: $VAR1 = { 'hash_1' => { 'a' => 'a[bdy]dfjaPÃ‘sdafÃœ', 'b' => 'WAP' }, 'hash_2' => { 'c' => 'Te st' } }; Non Printable Character: Ã Non Printable Character: ‘ Non Printable Character: Ã Non Printable Character: œ Result Hash: $VAR1 = { 'hash_1' => { 'a' => 'a[bdy]dfjaPÃ‘sdafÃœ' }, 'hash_2' => { 'c' => 'Te st' } }; [download] Update: re-indented the code above. I think the result is better now. My editor barfed when converting the OP's tabs to spaces.	[reply] [d/l] [select]
Re: How to detect non printable characters and non white space characters? by Eily (Monsignor) on Feb 17, 2017 at 10:17 UTC
First, /g means "global", which means search all the matching results, don't stop at the first one (either one result at a time, or all of them at once). This technically doesn't change anything, but you just need to find that there is one non-printable character to keep the string. So the first regex could just be `/[^[:print:]]/`. The straightforward solution to your problem is to use an and rather than an or in your condition. You want a string that (only contains valid printable chars) AND (only contains valid non-space chars) => (does not contain non-printables) AND (does not contain spaces). De Morgan's laws might also be a good read on how to negate a condition with ORs and ANDs. However, I think you can obtain something easier to read with unless (which is an "if not") and next which will stop processing the current element in a loop, and go on to the next. `# Delete the value and try the next one unless it has a non-printable, + or a space delete $$hoh_ref{$key}{$value} and next unless /[^[:print:]]/ or /\s/; while ($$hoh_ref{$key}{$value} =~ /([^[:print:]])/g) { print "Unprintable !\n"; }` [download]	[reply] [d/l] [select]
Re^2: How to detect non printable characters and non white space characters? by thanos1983 (Parson) on Feb 17, 2017 at 11:09 UTC
Hello Eily You are absolutely right, I should have used and instead of or. I thought about it when I was testing my code, but it was failing. The reason is straight forward on my and condition I was using the white space condition and since it was matching this condition it did not even care to check for non printable character. Now that I am thinking about it makes perfect sense. Thank you for your time reading and replying to my question. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re: How to detect non printable characters and non white space characters? [\p{} character classes] by kcott (Archbishop) on Feb 17, 2017 at 23:25 UTC
G'day thanos1983, I see your condition issue has been resolved. I'm currently working on a private project in a similar area (in my case, it is intended to handle any Unicode® character). This uses the `\p{}` construct to determine character type. You may be interested in this approach. I've posted test code below. Be aware that this is a work-in-progress: the code I've posted works fine but is incomplete. There's more detailed Notes after the code. #!/usr/bin/env perl use 5.025009; use strict; use warnings; use utf8; use open IO => qw{:encoding(utf8) :std}; use charnames ':full'; BEGIN { my @types = qw{CONTROL PRINT COMBINE UNKNOWN}; eval 'use enum @types'; sub type_name { $types[$_[0]] } } use Test::More; my @tests = ( [ '0000' => CONTROL ], [ '0009' => CONTROL ], [ '000a' => CONTROL ], [ '0020' => PRINT ], [ '0021' => PRINT ], [ '0030' => PRINT ], [ '0040' => PRINT ], [ '0041' => PRINT ], [ '0060' => PRINT ], [ '0061' => PRINT ], [ '007e' => PRINT ], [ '007f' => CONTROL ], [ '00a0' => PRINT ], [ '0300' => COMBINE ], [ '034f' => COMBINE ], [ '2000' => PRINT ], [ '200d' => CONTROL ], [ '2028' => CONTROL ], [ '2029' => CONTROL ], [ 'fe00' => CONTROL ], [ '1f3fb' => PRINT ], [ 'e0100' => CONTROL ], [ '10ffff' => CONTROL ], ); plan tests => scalar @tests; for my $test (@tests) { my ($input, $exp) = $test->@; is(type_of($input), $exp, "Checking '@{[sprintf q{%04X}, hex $input]}'" . " is of type '@{[type_name($exp)]}'" . " (Got: '@{[type_name(type_of($input))]}')." ); } sub type_of { my ($input) = @_; my $char = chr hex $input; return CONTROL if $char =~ / [ \p{C} \p{Zl} \p{Zp} \p{VS} ] /xx; return PRINT if $char =~ / [ \p{L} \p{N} \p{P} \p{S} \p{Zs} ] / +xx; return COMBINE if $char =~ / [ \p{M} ] /xx; return UNKNOWN; } [download] Notes:* I required the Perl ( DEVELOPER RELEASE ) version `5.25.9` because: I wanted Unicode v9.0 support (actually `5.25.3` is sufficient for that, see http://search.cpan.org/~shay/perl-5.25.3/pod/perldelta.pod#Unicode_9.0_is_now_supported). I like the double-x regex modifier (`//xx`). This allows whitespace in character classes, as well as the rest of the regex: in my opinion, this greatly improves readability. Be aware that this feature appears to have been added and removed at various times; for instance, it was removed in `5.25.1` (http://search.cpan.org/~xsawyerx/perl-5.25.1/pod/perldelta.pod#qr//xx_is_no_longer_permissible). Purely pragmatic reasons: that's the only `5.25.x` version I currently have installed. The `utf8`, `open` and `charnames` pragmata are not needed for the code in its present state: those are for future development. And `strict` doesn't need to be explicitly stated: it's implicit with any `use VERSION` where `VERSION` >= `5.12.0`. enum is a CPAN module. perluniprops lists all the `\p{}` constructs I've used. The descriptions in http://www.unicode.org/reports/tr44/#Property_Values are more detailed. The types I've used should be mostly, self-explanatory. In brief: `CONTROL`: don't print the character as-is; print a representation of its code point. `PRINT`: print the character. `COMBINE`: a combining character; print in combination with another character (I typically use `U+25CC` for this, e.g. `◌̀` renders as `◌̀`). `UNKNOWN`: traps anything I've failed to match. All tests pass. Here's an extract of the output: `1..23 ok 1 - Checking '0000' is of type 'CONTROL' (Got: 'CONTROL'). ... ok 23 - Checking '10FFFF' is of type 'CONTROL' (Got: 'CONTROL').` [download] See also Unicode::UCD. I found this useful for checking property values of individual characters. — Ken	[reply] [d/l] [select]
Re^2: How to detect non printable characters and non white space characters? [\p{} character classes] by thanos1983 (Parson) on Feb 24, 2017 at 16:16 UTC
Hello kcott, Sorry for the late reply, I just saw your reply. This looks a great idea (although work in progress as you said). The only problem is that I do not have on the test bed that late version of Perl. `v5.22.1` but I think even some nodes are running lower versions :P due to old OS releases. But never the less thanks again for your time and effort reading and replying to my question. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]