in reply to Regular Expressions \b and \B

How about this comment :-)

s//#Just#Another#Perl#Hacker#/; s/\b\W/$1 /g; print;

You can even help with speech recogition for stutter challenged hackers

s//J-J-J-J>Just Another Perl Hacker/; s/\b\w\b//g; print;

Personally I don't use \b much. One of the big problems I find with regexes is accidentally matching things you did not meant to, so I try to be as specific as possible. Rather than specify a boundary I specify exactly what I want to follow. If you don't want to eat up your string you can use the lookahead assertions so the zero width nature of \b has no advantage on that front. These expressions are similar:

# using boundaries s//#foo foo foobar foo#/; s/\bfoo\b//g; print; # using negative lookahead and lookbehind assertions s//#foo foo foobar foo#/; s/(?<!\w)foo(?!\w)//g; print; # using positive lookahead and lookbehind assertions with char classes s//#foo foo foobar foo#/; s/(?<=[^\w])foo(?=[^\w])//g; print;

The difference is that with the lookaround assertions I have much greater control as I can use a character class in them as shown.

TIMTOWTDI

cheers

tachyon

Replies are listed 'Best First'.
Re: Re: Regular Expressions
by Hofmator (Curate) on Jun 27, 2001 at 13:04 UTC

    A good description, tachyon++ - you are correct that the lookarounds offer more flexibility but let me point out the differences in terms of benchmarks:

    #!/usr/bin/perl use Benchmark qw/cmpthese/; $defaulttext = q/foo / x 30; # $defaulttext = q/foobar / x 30; cmpthese( 100_000, { slash_b => q{$text=$defaulttext; $text =~ s/\bfoo\b//g;}, neg_look=> q{$text=$defaulttext; $text =~ s/(?<!\w)foo(?!\w)//g;}, pos_look=> q{$text=$defaulttext; $text =~ s/(?<=[^\w])foo(?=[^\w]) +//g;}, });

    With $defaulttext being 'foo foo ...' all three methods take approx. the same time, the changing of $text takes a decisive amount of time.

    With $defaulttext being 'foobar foobar ...' - i.e. no replacements are done - I get the following results:

    Rate pos_look neg_look slash_b pos_look 27894/s -- -8% -35% neg_look 30441/s 9% -- -29% slash_b 42662/s 53% 40% --

    This shows that the \b variant is about 50% quicker and the negative lookaround is better than the negated character class.

    But the most important difference can be seen from the following code

    $text= q/foo bar foo/; ($tmp = $text) =~ s/\bfoo\b//g; print $tmp,"\n"; ($tmp = $text) =~ s/(?<!\w)foo(?!\w)//g; print $tmp,"\n"; ($tmp = $text) =~ s/(?<=[^\w])foo(?=[^\w])//g; print $tmp,"\n"; # which prints: bar bar foo bar foo

    The positive lookaround does not behave like the others at the boundaries of the string. This is because the positive lookaround looks for a character (class) but - as there is no character before the beginning of the string or after the end - it fails. The negative lookaround works even if no character is there.

    -- Hofmator

      Good points, you might have noticed that I carefully used the word 'similar' rather than 'same'. As you point out there are differences both in speed and what matches where. My grasp of regexes continues to grow thanks in large part to posts like these ++

      cheers

      tachyon

Re: Re: Regular Expressions
by vbrtrmn (Pilgrim) on Jun 27, 2001 at 08:33 UTC
    Okay, I get that, how about \B?
    --
    paul
      If \b matches an invisible area between a word and a non-word character, \B matches the invisible spaces that are between two word characters OR between two non-word characters.

      For instance, /f\B./ matches "fr" but it does not match "f@" because the invisible space between 'f' and '@' is a word boundary and we are looking for invisible spaces that are not word boundaries. Similary /@\B./ would match "@}" but it would not match "@A".

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot";
      $nysus = $PM . $MCF;