Re: most efficient regex to delete duplicate words

Premature optimization is the root of all evil (D. Knuth)

So, before we look at how to do this fast, let's look at how to do it correctly :-)

What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that :

use strict;

my $test = "alpha beta  beta gamma gamma gamma delta";

# /(\b\w+\b)/ matches a single word :
print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g;

# but we want such a word, followed by (whitespace and then) the word 
+again :
print "Double word : $1\n" if $test =~ /(\b\w+\b)\s*\1/;

# but we not only want to catch duplicates, we also want to catch
# multiple repetitions :
print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s*\1)+/;

# And since we're throwing it away anyway, there's no need to actually
# capture stuff into $2 :
print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s*\1)+/;

# Now, here's the whole RE to seek out repeating words
# and collapse them into one word.
$test =~ s/(\b\w+\b)(?:\s*\1)+/$1/g;

print $test;
[download]

In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement.

Update: Changed attribution from Kernighan to Knuth

Comment on Re: most efficient regex to delete duplicate words Download Code

Replies are listed 'Best First'.
Re: Re: most efficient regex to delete duplicate words by jered (Initiate) on Aug 14, 2001 at 06:27 UTC
Sorry, I just wanted to point out that Don Knuth said that, not Brian Kernighan. I know it seems trivial, but to me it's the difference between an idol and a god. Otherwise your post makes perfect sense to me. :)	[reply]