Regular Expressions \b and \B

vbrtrmn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expressions by tachyon (Chancellor) on Jun 27, 2001 at 08:08 UTC
How about this comment :-) `s//#Just#Another#Perl#Hacker#/; s/\b\W/$1 /g; print;` [download] You can even help with speech recogition for stutter challenged hackers `s//J-J-J-J>Just Another Perl Hacker/; s/\b\w\b//g; print;` [download] Personally I don't use \b much. One of the big problems I find with regexes is accidentally matching things you did not meant to, so I try to be as specific as possible. Rather than specify a boundary I specify exactly what I want to follow. If you don't want to eat up your string you can use the lookahead assertions so the zero width nature of \b has no advantage on that front. These expressions are similar: `# using boundaries s//#foo foo foobar foo#/; s/\bfoo\b//g; print; # using negative lookahead and lookbehind assertions s//#foo foo foobar foo#/; s/(?<!\w)foo(?!\w)//g; print; # using positive lookahead and lookbehind assertions with char classes s//#foo foo foobar foo#/; s/(?<=[^\w])foo(?=[^\w])//g; print;` [download] The difference is that with the lookaround assertions I have much greater control as I can use a character class in them as shown. TIMTOWTDI cheers tachyon	[reply] [d/l] [select]
Re: Re: Regular Expressions by Hofmator (Curate) on Jun 27, 2001 at 13:04 UTC
A good description, tachyon++ - you are correct that the lookarounds offer more flexibility but let me point out the differences in terms of benchmarks: `#!/usr/bin/perl use Benchmark qw/cmpthese/; $defaulttext = q/foo / x 30; # $defaulttext = q/foobar / x 30; cmpthese( 100_000, { slash_b => q{$text=$defaulttext; $text =~ s/\bfoo\b//g;}, neg_look=> q{$text=$defaulttext; $text =~ s/(?<!\w)foo(?!\w)//g;}, pos_look=> q{$text=$defaulttext; $text =~ s/(?<=[^\w])foo(?=[^\w]) +//g;}, });` [download] With $defaulttext being 'foo foo ...' all three methods take approx. the same time, the changing of $text takes a decisive amount of time. With $defaulttext being 'foobar foobar ...' - i.e. no replacements are done - I get the following results: `Rate pos_look neg_look slash_b pos_look 27894/s -- -8% -35% neg_look 30441/s 9% -- -29% slash_b 42662/s 53% 40% --` [download] This shows that the \b variant is about 50% quicker and the negative lookaround is better than the negated character class. But the most important difference can be seen from the following code `$text= q/foo bar foo/; ($tmp = $text) =~ s/\bfoo\b//g; print $tmp,"\n"; ($tmp = $text) =~ s/(?<!\w)foo(?!\w)//g; print $tmp,"\n"; ($tmp = $text) =~ s/(?<=[^\w])foo(?=[^\w])//g; print $tmp,"\n"; # which prints: bar bar foo bar foo` [download] The positive lookaround does not behave like the others at the boundaries of the string. This is because the positive lookaround looks for a character (class) but - as there is no character before the beginning of the string or after the end - it fails. The negative lookaround works even if no character is there. -- Hofmator	[reply] [d/l] [select]
Re: Re: Re: Regular Expressions by tachyon (Chancellor) on Jun 27, 2001 at 16:03 UTC
Good points, you might have noticed that I carefully used the word 'similar' rather than 'same'. As you point out there are differences both in speed and what matches where. My grasp of regexes continues to grow thanks in large part to posts like these ++ cheers tachyon	[reply]
Re: Re: Regular Expressions by vbrtrmn (Pilgrim) on Jun 27, 2001 at 08:33 UTC
Okay, I get that, how about `\B`? -- paul	[reply] [d/l]
Re: Re: Re: Regular Expressions by nysus (Parson) on Jun 27, 2001 at 08:46 UTC
If \b matches an invisible area between a word and a non-word character, \B matches the invisible spaces that are between two word characters OR between two non-word characters. For instance, /f\B./ matches "fr" but it does not match "f@" because the invisible space between 'f' and '@' is a word boundary and we are looking for invisible spaces that are not word boundaries. Similary /@\B./ would match "@}" but it would not match "@A". $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ Abbot"; $nysus = $PM . $MCF;	[reply]
Re: Regular Expressions by nysus (Parson) on Jun 27, 2001 at 08:37 UTC
Well, first you have to know that a RE evaluates not on a character but between characters. Take the following RE: /hello/ What actually happens is that the RE engine starts first not directly at the 'h' but at the invisible area just before it. Then when it finds an 'h' in your string, the RE then jumps between the 'h' and the 'e' and starts looking for an 'e'. So by now you should be imagining your RE as a sting of characters with invisible spaces in between them that the RE engine sits at while looking for the next character. When the RE engine encounters the \b metacharacter, it tells the engine to look not only look at the character after the invisible spaces but also at the character before the invisible space. The engine then compares the two characters to eachother and asks a question: "Is the character to the left of me a non-word character (that is anything except a number, a letter, or an underscore), and is the character to the right of me a word character?" If this is the case, then the RE will find a match because it recognizes this invisible space as a "word boundary" and that it is between a non-word character and a word character. Note that this is also true at the end of a word except that it has a word character to the left and a non-word character to the right. Hope this helps. Let me know if you have specific questions. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ Abbot"; $nysus = $PM . $MCF;	[reply]
Re: Regular Expressions by clemburg (Curate) on Jun 27, 2001 at 12:41 UTC
I am starting to study up on Regular Expressions. Then, by all means, get a copy of the Hip Owls book and read it. Fun fact: Once upon a time, I calculated the time savings for some projects that came from applying regexes (which I learned about from this book - my first O'Reilly and Perl book ...). It came out that this book is literally worth its weight in gold! Once you really get regexes, they will transform the way you think about data. And they are everywhere - in your editor, on the command line, even Word has some (admittedly funny ones). Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply]
Re: Re: Regular Expressions by andye (Curate) on Jun 27, 2001 at 15:02 UTC
Then, by all means, get a copy of the Hip Owls book and read it. Seconded! It's great. I'm particularly amused by the part where it says "Of course, knowing about such-and-such isn't necessary for a basic understanding of regular expressions - which is why we'll be going into it in great detail". (Well, it says something like that - the book's at home and I'm not). andy. /me was reading it the week before last on a Greek beach - wish I was still there!	[reply]