comment on

Premature optimization is the root of all evil (D. Knuth)

So, before we look at how to do this fast, let's look at how to do it correctly :-)

What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that :

use strict;

my $test = "alpha beta  beta gamma gamma gamma delta";

# /(\b\w+\b)/ matches a single word :
print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g;

# but we want such a word, followed by (whitespace and then) the word 
+again :
print "Double word : $1\n" if $test =~ /(\b\w+\b)\s*\1/;

# but we not only want to catch duplicates, we also want to catch
# multiple repetitions :
print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s*\1)+/;

# And since we're throwing it away anyway, there's no need to actually
# capture stuff into $2 :
print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s*\1)+/;

# Now, here's the whole RE to seek out repeating words
# and collapse them into one word.
$test =~ s/(\b\w+\b)(?:\s*\1)+/$1/g;

print $test;
[download]

In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement.

Update: Changed attribution from Kernighan to Knuth

In reply to Re: most efficient regex to delete duplicate words by Corion
in thread most efficient regex to delete duplicate words by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.