in reply to How to determine if string contains non-ASCII characters ?

Note that your regex matches the empty string, and thus any string at all.

Also ASCII covers only the bytes from 0 to 127, so your regex should read

$string =~ m/[\x00-\x7f]/
Perl 6 projects - links to (nearly) everything that is Perl 6.

Replies are listed 'Best First'.
Re^2: How to determine if string contains non-ASCII characters ?
by jethro (Monsignor) on Aug 19, 2009 at 17:08 UTC

    Also the character class needs to be negated to search for characters not in the range and then the result negated (here with !~ instead of =~), otherwise he will only detect if the string has any asccii chars

    $string !~ m/[^\x00-\x7f]/

      Your solution checks if any characters aren't ASCII.

      my $is_ascii = $string !~ /[^\x00-\x7F]/;

      An alternative is to check that all characters are ASCII.

      my $is_ascii = $string =~ /^[\x00-\x7F]*\z/;

      Both are fine. The first can be faster if the result is false, but the difference is probably inconsequential in practice.

      Thanks for the hints.

      However, both regexes still don't appear to be hitting the spot. I've created a small test program which pulls an Arabic title from a webpage to demonstrate:

      use LWP::UserAgent; $ua = LWP::UserAgent->new; my $resp = $ua->get("http://www.englishlink.com/index_ARE_HTML.asp"); if ($resp->is_success) { $mystring = $resp->content; $mystring =~ s/.*\<title\>//sgi; $mystring =~ s/\<.*//sgi; } print "$mystring\n"; if ($mystring =~ m/[^\x00-\x7f]/) { print "Contains ASCII only\n"; } else { print "Contains non-ASCII\n"; }
      When run, I'd expect to see the result as "Contains non-ASCII", but instead I get "Contains ASCII only"

      Any thoughts as to why ?

        You're assigning meaning to the bytes without checking which encoding was used. In fact, not only do you not handle the character encoding, you don't handle transfer encoding either! Using ->content() is practically always a bug. One should use ->decoded_content() or ->decoded_content( charset => 'none' ) instead.
        That's because you reversed the conditions - if it contains bytes from the range > 127, it's not ASCII.

        Update: I'd strongly recommend to use at least perl-5.8.2 for any text processing that involves non-ASCII characters. It's a real pain with 5.6.*.

        Perl 6 projects - links to (nearly) everything that is Perl 6.
        ... but it seems that plugging in ikegami's regex into the test program works !

        This is without doubt the most helpful forum around. Big thanks to all of you.