roadrunner has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone know of a regex that can detect any occurrence of a non-ASCII character in a given string ?

My program is importing strings that could contain some non-ASCII characters in them and I'd like to treat those types of string differently.

I've tried:
$mystring =~ m/([\x00-\xff]*)/gi
... which should evaluate to true if the string is ASCII only, and and false if there are any non-ASCII characters in it. However, using a simple Arabic string seems to evaluate to true.

Note: I'm using ActivePerl 5.6.1 on Win32 so I can't use the Encode module, which is why I'd like to use regex instead.

Thanks.

Replies are listed 'Best First'.
Re: How to determine if string contains non-ASCII characters ?
by moritz (Cardinal) on Aug 19, 2009 at 16:57 UTC
    Note that your regex matches the empty string, and thus any string at all.

    Also ASCII covers only the bytes from 0 to 127, so your regex should read

    $string =~ m/[\x00-\x7f]/
    Perl 6 projects - links to (nearly) everything that is Perl 6.

      Also the character class needs to be negated to search for characters not in the range and then the result negated (here with !~ instead of =~), otherwise he will only detect if the string has any asccii chars

      $string !~ m/[^\x00-\x7f]/

        Your solution checks if any characters aren't ASCII.

        my $is_ascii = $string !~ /[^\x00-\x7F]/;

        An alternative is to check that all characters are ASCII.

        my $is_ascii = $string =~ /^[\x00-\x7F]*\z/;

        Both are fine. The first can be faster if the result is false, but the difference is probably inconsequential in practice.

        Thanks for the hints.

        However, both regexes still don't appear to be hitting the spot. I've created a small test program which pulls an Arabic title from a webpage to demonstrate:

        use LWP::UserAgent; $ua = LWP::UserAgent->new; my $resp = $ua->get("http://www.englishlink.com/index_ARE_HTML.asp"); if ($resp->is_success) { $mystring = $resp->content; $mystring =~ s/.*\<title\>//sgi; $mystring =~ s/\<.*//sgi; } print "$mystring\n"; if ($mystring =~ m/[^\x00-\x7f]/) { print "Contains ASCII only\n"; } else { print "Contains non-ASCII\n"; }
        When run, I'd expect to see the result as "Contains non-ASCII", but instead I get "Contains ASCII only"

        Any thoughts as to why ?