Regex for Differentiating Underscore and Whitespace

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Regex for Differentiating Underscore and Whitespace
by jasonk (Parson) on Nov 03, 2007 at 02:52 UTC

Your regexp is getting turned into //ms, which matches everything, because whitespace is removed from the regex when you use /x. Either remove the /x modifier, or use \s instead of literal whitespace.

We're not surrounded, we're in a target-rich environment!

[reply]
[d/l]
[select]

Re: Regex for Differentiating Underscore and Whitespace
by grinder (Bishop) on Nov 03, 2007 at 11:42 UTC

Who needs a regexp?

if (index($str, '_') < 0) {  
    print "no underscore\n";  
}
else {
    print "with underscore\n";
}
[download]

• another intruder with the mooring in the heart of the Perl

[reply]
[d/l]

Re: Regex for Differentiating Underscore and Whitespace
by moritz (Cardinal) on Nov 03, 2007 at 09:19 UTC

if ($str !~ m/_/)

if ($str =~ m/\A[^_]*\z/)

[reply]
[d/l]
[select]

Re: Regex for Differentiating Underscore and Whitespace
by davido (Cardinal) on Nov 03, 2007 at 04:29 UTC

How about transliteration?

if( $str =~ tr/_/_/ ) {
    print "with underscore\n";
} else {
    print "no underscore\n";
}
[download]

Or just for fun...

my $result = 
      ( $str =~ tr/_/_/ )
    ? "with "
    : "no ";

print "$result underscore\n";
[download]

perlop explains the use of the tr/// operator.

Dave

[reply]
[d/l]
[select]

Re: Regex for Differentiating Underscore and Whitespace
by CountZero (Bishop) on Nov 03, 2007 at 09:53 UTC

Although it already correctly identify this string: $str = "TIP UPSTREAM"; as with underscore.

"TIP UPSTREAM"

x

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

[reply]
[d/l]
[select]

Re: Regex for Differentiating Underscore and Whitespace
by mwah (Hermit) on Nov 03, 2007 at 18:03 UTC

neversaint

what's wrong with my script above such that it prints no underscore instead of with underscore

There have been already correcting hints in all directions by moritz and CountZero. From analyzing your code, jasonk pointed out your misconception on /x and whitespace, which means your code would work as intended if you change the regex modifier to:

 ...
 if ( $str =~ / / ) {  
    print "no underscore\n";  
 }
 else {
    print "with underscore\n";
 }
 ...
[download]

The /x modifier would lead the regex to ignore the space (as has been said) and the /m and /s aren't needed here (they don't do anything)

In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

In addition to these hints, davido tackles the problem by an important feature of the tr// (transliteration) operator - to count occurrences of characters in very efficient way. This would reduce your problem to the following expression:

  ...
  my $str = $ARGV[0] || '|78187980|ref|NM_0';          # original stri
+ng

  my $cnt = $str =~ tr/_//;                            # count the num
+ber of underscores

  print 'with ' . ($cnt || 'no') . " underscore(s)\n"; # print result 
+depending on count
  ...
[download]

after which you may decide on the 'count' of the character in question.

Regards

mwa

[reply]
[d/l]
[select]

Re^2: Regex for Differentiating Underscore and Whitespace

by blazar (Canon) on Nov 04, 2007 at 11:15 UTC

In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

I personally believe that the claim about efficiency is not correct, since that kind of regex should get optimized to index anyway - and often regexen have a more immediately readable syntax. For a Perl programmer that is...

I hope that the following minimal benchmark can shed some light:

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw/cmpthese :hireswallclock/;

my @a = do {
    my @chr=(grep /\w/, map chr, 1..255);
    map { 
        local $_ = join '', map $chr[rand @chr], 1..1000;
        tr/_/ / if .5<rand;
        $_;
    } 1..1000;
};

cmpthese 5000 => { Regex => sub () { grep !/_/, @a },
                   Index => sub () { grep index($_, '_') < 0, @a } };

__END__
[download]

I get e.g.

C:\temp>perl index.pl
       Rate Index Regex
Index 891/s    --   -0%
Regex 891/s    0%    --
[download]

and

blazar@perlmonk ~ $ perl index.pl
       Rate Index Regex
Index 261/s    --   -0%
Regex 262/s    0%    --
[download]

on two different systems.

Now, is this test flawed? I easily tend to get these kinda things wrong, I must admit...

[reply]
[d/l]
[select]

Re^3: Regex for Differentiating Underscore and Whitespace

by mwah (Hermit) on Nov 04, 2007 at 14:09 UTC

blazar

Now, is this test flawed?

You are basically correct here. I was too zealous here to advertise the vantages of index() and tr//. They have their run elsewhere, but not in this special case. Thanks for pointing this out.

I abused your benchmark code (of course) to find out on how good the index() optimization in Perl5 really is ;-)

...
use Benchmark qw/cmpthese :hireswallclock/;

my @a = map { my $s='PM is cool, ' x 10_000;
              substr($s, rand(length $s), 1, '_');
              $s
            } 1..1000;

cmpthese -3 => {
                C_Idx => sub () { grep C_Idx($_, '_') < 0, @a },
                Index => sub () { grep index($_, '_') < 0, @a },
                Regex => sub () { grep ! /_/,    @a },
                Tr    => sub () { grep ! tr/_//, @a }
               };

use Inline C => qq[
 int C_Idx(SV* src, SV* chr)
{
 STRLEN srclen, chrlen;
 char *ssrc = SvPV(src, srclen), *schr = SvPV(chr, chrlen);
 char *p = ssrc;
 if( chrlen != 1 ) croak("single characters only for now!");
 return (p=memchr(p, *schr, srclen)) != NULL ? p-ssrc : -1;
}
];
...
[download]

On my system, somehow above 60-70K strings - the index() falls behind the c-library function for finding a character (memchr). For the above strings:

        Rate    Tr Regex Index C_Idx
Tr    3.17/s    --  -74%  -74%  -87%
Regex 12.2/s  284%    --   -0%  -52%
Index 12.2/s  285%    0%    --  -52%
C_Idx 25.2/s  696%  107%  107%    --
[download]

I personally believe it'd be much better If I'd read my own posts and think about their assumptions next time much more thoroughly ;-)

Regards

mwa

[reply]
[d/l]
[select]