in reply to Re^2: How to determine if string contains non-ASCII characters ?
in thread How to determine if string contains non-ASCII characters ?

Thanks for the hints.

However, both regexes still don't appear to be hitting the spot. I've created a small test program which pulls an Arabic title from a webpage to demonstrate:

use LWP::UserAgent; $ua = LWP::UserAgent->new; my $resp = $ua->get("http://www.englishlink.com/index_ARE_HTML.asp"); if ($resp->is_success) { $mystring = $resp->content; $mystring =~ s/.*\<title\>//sgi; $mystring =~ s/\<.*//sgi; } print "$mystring\n"; if ($mystring =~ m/[^\x00-\x7f]/) { print "Contains ASCII only\n"; } else { print "Contains non-ASCII\n"; }
When run, I'd expect to see the result as "Contains non-ASCII", but instead I get "Contains ASCII only"

Any thoughts as to why ?

Replies are listed 'Best First'.
Re^4: How to determine if string contains non-ASCII characters ?
by ikegami (Patriarch) on Aug 19, 2009 at 17:54 UTC
    You're assigning meaning to the bytes without checking which encoding was used. In fact, not only do you not handle the character encoding, you don't handle transfer encoding either! Using ->content() is practically always a bug. One should use ->decoded_content() or ->decoded_content( charset => 'none' ) instead.
Re^4: How to determine if string contains non-ASCII characters ?
by moritz (Cardinal) on Aug 19, 2009 at 17:30 UTC
    That's because you reversed the conditions - if it contains bytes from the range > 127, it's not ASCII.

    Update: I'd strongly recommend to use at least perl-5.8.2 for any text processing that involves non-ASCII characters. It's a real pain with 5.6.*.

    Perl 6 projects - links to (nearly) everything that is Perl 6.
Re^4: How to determine if string contains non-ASCII characters ?
by roadrunner (Acolyte) on Aug 19, 2009 at 17:38 UTC
    ... but it seems that plugging in ikegami's regex into the test program works !

    This is without doubt the most helpful forum around. Big thanks to all of you.