neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have the following script.
use strict; use Data::Dumper; use Carp; my $str = $ARGV[0] || '|78187980|ref|NM_0'; if ( $str =~ / /xms ) { print "no underscore\n"; } else { print "with underscore\n"; }
With the default input parameter, what's wrong with my script above such that it prints  no underscore instead of  with underscore.

Although it already correctly identify this string:
$str = "TIP UPSTREAM";
as with underscore.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Regex for Differentiating Underscore and Whitespace
by jasonk (Parson) on Nov 03, 2007 at 02:52 UTC

    Your regexp is getting turned into //ms, which matches everything, because whitespace is removed from the regex when you use /x. Either remove the /x modifier, or use \s instead of literal whitespace.


    We're not surrounded, we're in a target-rich environment!
Re: Regex for Differentiating Underscore and Whitespace
by grinder (Bishop) on Nov 03, 2007 at 11:42 UTC

    Who needs a regexp?

    if (index($str, '_') < 0) { print "no underscore\n"; } else { print "with underscore\n"; }

    • another intruder with the mooring in the heart of the Perl

Re: Regex for Differentiating Underscore and Whitespace
by moritz (Cardinal) on Nov 03, 2007 at 09:19 UTC
    if you want to test if a string contains no underscore, use either if ($str !~ m/_/) or if ($str =~ m/\A[^_]*\z/).
Re: Regex for Differentiating Underscore and Whitespace
by davido (Cardinal) on Nov 03, 2007 at 04:29 UTC

    How about transliteration?

    if( $str =~ tr/_/_/ ) { print "with underscore\n"; } else { print "no underscore\n"; }

    Or just for fun...

    my $result = ( $str =~ tr/_/_/ ) ? "with " : "no "; print "$result underscore\n";

    perlop explains the use of the tr/// operator.


    Dave

Re: Regex for Differentiating Underscore and Whitespace
by CountZero (Bishop) on Nov 03, 2007 at 09:53 UTC
    Although it already correctly identify this string: $str = "TIP UPSTREAM"; as with underscore.
    There is no underscore in "TIP UPSTREAM"! so your regex does not correctly identify your string. No wonder since you are not checking for underscores, but just for a space (and not even that due to the use of the x modifier).

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Regex for Differentiating Underscore and Whitespace
by mwah (Hermit) on Nov 03, 2007 at 18:03 UTC
    neversaint
    what's wrong with my script above such that it prints no underscore instead of with underscore

    There have been already correcting hints in all directions by moritz and CountZero. From analyzing your code, jasonk pointed out your misconception on /x and whitespace, which means your code would work as intended if you change the regex modifier to:

    ... if ( $str =~ / / ) { print "no underscore\n"; } else { print "with underscore\n"; } ...

    The /x modifier would lead the regex to ignore the space (as has been said) and the /m and /s aren't needed here (they don't do anything)

    In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

    In addition to these hints, davido tackles the problem by an important feature of the tr// (transliteration) operator - to count occurrences of characters in very efficient way. This would reduce your problem to the following expression:

    ... my $str = $ARGV[0] || '|78187980|ref|NM_0'; # original stri +ng my $cnt = $str =~ tr/_//; # count the num +ber of underscores print 'with ' . ($cnt || 'no') . " underscore(s)\n"; # print result +depending on count ...

    after which you may decide on the 'count' of the character in question.

    Regards

    mwa

      In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

      I personally believe that the claim about efficiency is not correct, since that kind of regex should get optimized to index anyway - and often regexen have a more immediately readable syntax. For a Perl programmer that is...

      I hope that the following minimal benchmark can shed some light:

      #!/usr/bin/perl use strict; use warnings; use Benchmark qw/cmpthese :hireswallclock/; my @a = do { my @chr=(grep /\w/, map chr, 1..255); map { local $_ = join '', map $chr[rand @chr], 1..1000; tr/_/ / if .5<rand; $_; } 1..1000; }; cmpthese 5000 => { Regex => sub () { grep !/_/, @a }, Index => sub () { grep index($_, '_') < 0, @a } }; __END__

      I get e.g.

      C:\temp>perl index.pl Rate Index Regex Index 891/s -- -0% Regex 891/s 0% --

      and

      blazar@perlmonk ~ $ perl index.pl Rate Index Regex Index 261/s -- -0% Regex 262/s 0% --

      on two different systems.

      Now, is this test flawed? I easily tend to get these kinda things wrong, I must admit...

        blazar
        Now, is this test flawed?

        You are basically correct here. I was too zealous here to advertise the vantages of index() and tr//. They have their run elsewhere, but not in this special case. Thanks for pointing this out.

        I abused your benchmark code (of course) to find out on how good the index() optimization in Perl5 really is ;-)

        ... use Benchmark qw/cmpthese :hireswallclock/; my @a = map { my $s='PM is cool, ' x 10_000; substr($s, rand(length $s), 1, '_'); $s } 1..1000; cmpthese -3 => { C_Idx => sub () { grep C_Idx($_, '_') < 0, @a }, Index => sub () { grep index($_, '_') < 0, @a }, Regex => sub () { grep ! /_/, @a }, Tr => sub () { grep ! tr/_//, @a } }; use Inline C => qq[ int C_Idx(SV* src, SV* chr) { STRLEN srclen, chrlen; char *ssrc = SvPV(src, srclen), *schr = SvPV(chr, chrlen); char *p = ssrc; if( chrlen != 1 ) croak("single characters only for now!"); return (p=memchr(p, *schr, srclen)) != NULL ? p-ssrc : -1; } ]; ...

        On my system, somehow above 60-70K strings - the index() falls behind the c-library function for finding a character (memchr). For the above strings:

        Rate Tr Regex Index C_Idx Tr 3.17/s -- -74% -74% -87% Regex 12.2/s 284% -- -0% -52% Index 12.2/s 285% 0% -- -52% C_Idx 25.2/s 696% 107% 107% --

        I personally believe it'd be much better If I'd read my own posts and think about their assumptions next time much more thoroughly ;-)

        Regards

        mwa