in reply to Re: How to determine if string contains non-ASCII characters ?
in thread How to determine if string contains non-ASCII characters ?

Also the character class needs to be negated to search for characters not in the range and then the result negated (here with !~ instead of =~), otherwise he will only detect if the string has any asccii chars

$string !~ m/[^\x00-\x7f]/

Replies are listed 'Best First'.
Re^3: How to determine if string contains non-ASCII characters ?
by ikegami (Patriarch) on Aug 19, 2009 at 17:22 UTC

    Your solution checks if any characters aren't ASCII.

    my $is_ascii = $string !~ /[^\x00-\x7F]/;

    An alternative is to check that all characters are ASCII.

    my $is_ascii = $string =~ /^[\x00-\x7F]*\z/;

    Both are fine. The first can be faster if the result is false, but the difference is probably inconsequential in practice.

Re^3: How to determine if string contains non-ASCII characters ?
by roadrunner (Acolyte) on Aug 19, 2009 at 17:25 UTC
    Thanks for the hints.

    However, both regexes still don't appear to be hitting the spot. I've created a small test program which pulls an Arabic title from a webpage to demonstrate:

    use LWP::UserAgent; $ua = LWP::UserAgent->new; my $resp = $ua->get("http://www.englishlink.com/index_ARE_HTML.asp"); if ($resp->is_success) { $mystring = $resp->content; $mystring =~ s/.*\<title\>//sgi; $mystring =~ s/\<.*//sgi; } print "$mystring\n"; if ($mystring =~ m/[^\x00-\x7f]/) { print "Contains ASCII only\n"; } else { print "Contains non-ASCII\n"; }
    When run, I'd expect to see the result as "Contains non-ASCII", but instead I get "Contains ASCII only"

    Any thoughts as to why ?

      You're assigning meaning to the bytes without checking which encoding was used. In fact, not only do you not handle the character encoding, you don't handle transfer encoding either! Using ->content() is practically always a bug. One should use ->decoded_content() or ->decoded_content( charset => 'none' ) instead.
      That's because you reversed the conditions - if it contains bytes from the range > 127, it's not ASCII.

      Update: I'd strongly recommend to use at least perl-5.8.2 for any text processing that involves non-ASCII characters. It's a real pain with 5.6.*.

      Perl 6 projects - links to (nearly) everything that is Perl 6.
      ... but it seems that plugging in ikegami's regex into the test program works !

      This is without doubt the most helpful forum around. Big thanks to all of you.