Premature optimization is the root of all evil (D. Knuth)
So, before we look at how to do this fast, let's look at how to do it correctly :-)
What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that :
use strict; my $test = "alpha beta beta gamma gamma gamma delta"; # /(\b\w+\b)/ matches a single word : print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g; # but we want such a word, followed by (whitespace and then) the word +again : print "Double word : $1\n" if $test =~ /(\b\w+\b)\s*\1/; # but we not only want to catch duplicates, we also want to catch # multiple repetitions : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s*\1)+/; # And since we're throwing it away anyway, there's no need to actually # capture stuff into $2 : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s*\1)+/; # Now, here's the whole RE to seek out repeating words # and collapse them into one word. $test =~ s/(\b\w+\b)(?:\s*\1)+/$1/g; print $test;
In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement.
Update: Changed attribution from Kernighan to Knuth
In reply to Re: most efficient regex to delete duplicate words
by Corion
in thread most efficient regex to delete duplicate words
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |