thanos1983 has asked for the wisdom of the Perl Monks concerning the following question:

Hello again Monks,

Lately I have been bombing the forum with questions but no matter how much I experiment with my code I can not figure out the solution(s) to my problems this is why I keep asking questions over and over.

To the question, I have a hash of hashes with hundreds values on each hash. I want to use a negative condition on both detecting white space character or non printable character. In such a case I want to delete the element from the hash (if does not contain white space character or non printable character).

I tried the conditions separately and they work just fine, at least based on my experimentation examples. I need to combine them in a nested if because a value with no spaces it does not mean that it does not contain special characters.

Based on my research in order to detect non printable characters you can use either this /[^[:print:]]/g or this /[^[:ascii:]]/ regex expression found here (Finding out non ASCII Characters in the text)

Sample of code:

#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my $str = 'a[bdy]dfjaPÑsdafÜ'; my $str_2 = 'WAP'; my $hoh_ref = { hash_1 => { a => 'a[bdy]dfjaPÑsdafÜ', b => 'WAP' }, hash_2 => { c => 'Te st' } }; print Dumper $hoh_ref; foreach my $key (sort keys %{$hoh_ref}) { foreach my $value (keys %{$$hoh_ref{$key}}) { # If not white space character or non printable character remove e +lement if ($$hoh_ref{$key}{$value} !~ /[^[:print:]]/g || $$hoh_ref{$key}{$value} !~ /\s/) { delete $$hoh_ref{$key}{$value}; } elsif ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { while ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { print "Non Printable Characater:\t$&\n"; } } } } print Dumper $hoh_ref; __END__ $VAR1 = { 'hash_2' => { 'c' => 'Te st' }, 'hash_1' => { 'b' => 'WAP', 'a' => 'a[bdy]dfjaPÑsdafÜ' } }; $VAR1 = { 'hash_2' => {}, 'hash_1' => {} };

Desired output would be:

$VAR1 = { 'hash_2' => { 'c' => 'Te st' }, 'hash_1' => { 'a' => 'a[bdy]dfjaPÑsdafÜ' } };

Thanks in advance for your time and effort.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re: How to detect non printable characters and non white space characters?
by choroba (Cardinal) on Feb 17, 2017 at 10:22 UTC
    It seems you got De Morgan wrong: it should be
    # If not white space character AND not non printable character remove +element

    That's probably because there are too many negations. If you have an if with else , using negation in the condition makes it really hard to understand. So, instead of

    if ($$hoh_ref{$key}{$value} !~ /[^[:print:]]/g && $$hoh_ref{$key}{$value} !~ /\s/) { delete $$hoh_ref{$key}{$value}; } elsif ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { while ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { print "Non Printable Characater:\t$&\n"; } }

    you can use a bit simpler

    if ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/) { # no /g neede +d while ($$hoh_ref{$key}{$value} =~ /[^[:print:]]/g) { print "Non Printable Characater:\t$&\n"; } } elsif ($$hoh_ref{$key}{$value} !~ /\s/) { delete $$hoh_ref{$key}{$value}; }

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Hello choroba,

      I thought about it when I was implementing the solution to use and instead of or but it just did not make sense to me and it did not work when I was testing.

      You are absolutely right simpler the better. Thank you for your time and effort reading and replying to my question.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
        I likes the post from choroba++. A few minor suggestions:
        • I show a different indenting style below, but that is no big deal to me.
        • More important to me is: $href->{key1}{key2}I think that reads easier than use of $$href{key1}{key2}. My preference is to only use $$scalar_ref to de-reference a scalar value.
        • I replaced $& with a simple $1 capture. There can be some performance issues with $& with Perl < 5.20. See perlre. My preference is not to use $& unless needed, and here it is not.
        #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $str = 'a[bdy]dfjaPÑsdafÜ'; my $str_2 = 'WAP'; my $hoh_ref = { hash_1 => { a => 'a[bdy]dfjaPÑsdafÜ', b => 'WAP' }, hash_2 => { c => 'Te st' } }; print "Original Hash:\n"; print Dumper $hoh_ref; foreach my $main_key (sort keys %{$hoh_ref}) { foreach my $sub_key (keys %{$hoh_ref->{$main_key}}) { if ($hoh_ref->{$main_key}{$sub_key} =~ /[^[:print:]]/) { while ($hoh_ref->{$main_key}{$sub_key} =~ /([^[:print:]])/g) { print "Non Printable Character:\t$1\n"; } } elsif ($hoh_ref->{$main_key}{$sub_key} !~ /\s/) { delete $hoh_ref->{$main_key}{$sub_key}; } } } print "\nResult Hash:\n"; print Dumper $hoh_ref; __END__ Original Hash: $VAR1 = { 'hash_1' => { 'a' => 'a[bdy]dfjaPÑsdafÜ', 'b' => 'WAP' }, 'hash_2' => { 'c' => 'Te st' } }; Non Printable Character: à Non Printable Character: ‘ Non Printable Character: à Non Printable Character: œ Result Hash: $VAR1 = { 'hash_1' => { 'a' => 'a[bdy]dfjaPÑsdafÜ' }, 'hash_2' => { 'c' => 'Te st' } };
        Update: re-indented the code above. I think the result is better now. My editor barfed when converting the OP's tabs to spaces.
Re: How to detect non printable characters and non white space characters?
by Eily (Monsignor) on Feb 17, 2017 at 10:17 UTC

    First, /g means "global", which means search all the matching results, don't stop at the first one (either one result at a time, or all of them at once). This technically doesn't change anything, but you just need to find that there is one non-printable character to keep the string. So the first regex could just be /[^[:print:]]/.

    The straightforward solution to your problem is to use an and rather than an or in your condition. You want a string that (only contains valid printable chars) AND (only contains valid non-space chars) => (does not contain non-printables) AND (does not contain spaces). De Morgan's laws might also be a good read on how to negate a condition with ORs and ANDs.

    However, I think you can obtain something easier to read with unless (which is an "if not") and next which will stop processing the current element in a loop, and go on to the next.

    # Delete the value and try the next one unless it has a non-printable, + or a space delete $$hoh_ref{$key}{$value} and next unless /[^[:print:]]/ or /\s/; while ($$hoh_ref{$key}{$value} =~ /([^[:print:]])/g) { print "Unprintable !\n"; }

      Hello Eily

      You are absolutely right, I should have used and instead of or. I thought about it when I was testing my code, but it was failing.

      The reason is straight forward on my and condition I was using the white space condition and since it was matching this condition it did not even care to check for non printable character. Now that I am thinking about it makes perfect sense.

      Thank you for your time reading and replying to my question.

      Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: How to detect non printable characters and non white space characters? [\p{} character classes]
by kcott (Archbishop) on Feb 17, 2017 at 23:25 UTC

    G'day thanos1983,

    I see your condition issue has been resolved. I'm currently working on a private project in a similar area (in my case, it is intended to handle any Unicode® character). This uses the \p{} construct to determine character type. You may be interested in this approach.

    I've posted test code below. Be aware that this is a work-in-progress: the code I've posted works fine but is incomplete. There's more detailed Notes after the code.

    #!/usr/bin/env perl use 5.025009; use strict; use warnings; use utf8; use open IO => qw{:encoding(utf8) :std}; use charnames ':full'; BEGIN { my @types = qw{CONTROL PRINT COMBINE UNKNOWN}; eval 'use enum @types'; sub type_name { $types[$_[0]] } } use Test::More; my @tests = ( [ '0000' => CONTROL ], [ '0009' => CONTROL ], [ '000a' => CONTROL ], [ '0020' => PRINT ], [ '0021' => PRINT ], [ '0030' => PRINT ], [ '0040' => PRINT ], [ '0041' => PRINT ], [ '0060' => PRINT ], [ '0061' => PRINT ], [ '007e' => PRINT ], [ '007f' => CONTROL ], [ '00a0' => PRINT ], [ '0300' => COMBINE ], [ '034f' => COMBINE ], [ '2000' => PRINT ], [ '200d' => CONTROL ], [ '2028' => CONTROL ], [ '2029' => CONTROL ], [ 'fe00' => CONTROL ], [ '1f3fb' => PRINT ], [ 'e0100' => CONTROL ], [ '10ffff' => CONTROL ], ); plan tests => scalar @tests; for my $test (@tests) { my ($input, $exp) = $test->@*; is(type_of($input), $exp, "Checking '@{[sprintf q{%04X}, hex $input]}'" . " is of type '@{[type_name($exp)]}'" . " (Got: '@{[type_name(type_of($input))]}')." ); } sub type_of { my ($input) = @_; my $char = chr hex $input; return CONTROL if $char =~ / [ \p{C} \p{Zl} \p{Zp} \p{VS} ] /xx; return PRINT if $char =~ / [ \p{L} \p{N} \p{P} \p{S} \p{Zs} ] / +xx; return COMBINE if $char =~ / [ \p{M} ] /xx; return UNKNOWN; }

    Notes:

    • I required the Perl (** DEVELOPER RELEASE **) version 5.25.9 because:
    • The utf8, open and charnames pragmata are not needed for the code in its present state: those are for future development. And strict doesn't need to be explicitly stated: it's implicit with any use VERSION where VERSION >= 5.12.0.
    • enum is a CPAN module.
    • perluniprops lists all the \p{} constructs I've used. The descriptions in http://www.unicode.org/reports/tr44/#Property_Values are more detailed.
    • The types I've used should be mostly, self-explanatory. In brief:
      • CONTROL: don't print the character as-is; print a representation of its code point.
      • PRINT: print the character.
      • COMBINE: a combining character; print in combination with another character (I typically use U+25CC for this, e.g. &#x25cc;&#x300; renders as ◌̀).
      • UNKNOWN: traps anything I've failed to match.

    All tests pass. Here's an extract of the output:

    1..23 ok 1 - Checking '0000' is of type 'CONTROL' (Got: 'CONTROL'). ... ok 23 - Checking '10FFFF' is of type 'CONTROL' (Got: 'CONTROL').

    See also Unicode::UCD. I found this useful for checking property values of individual characters.

    — Ken

      Hello kcott,

      Sorry for the late reply, I just saw your reply. This looks a great idea (although work in progress as you said).

      The only problem is that I do not have on the test bed that late version of Perl. v5.22.1 but I think even some nodes are running lower versions :P due to old OS releases.

      But never the less thanks again for your time and effort reading and replying to my question.

      Seeking for Perl wisdom...on the process of learning...not there...yet!