in reply to most efficient regex to delete duplicate words
Premature optimization is the root of all evil (D. Knuth)
So, before we look at how to do this fast, let's look at how to do it correctly :-)
What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that :
use strict; my $test = "alpha beta beta gamma gamma gamma delta"; # /(\b\w+\b)/ matches a single word : print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g; # but we want such a word, followed by (whitespace and then) the word +again : print "Double word : $1\n" if $test =~ /(\b\w+\b)\s*\1/; # but we not only want to catch duplicates, we also want to catch # multiple repetitions : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s*\1)+/; # And since we're throwing it away anyway, there's no need to actually # capture stuff into $2 : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s*\1)+/; # Now, here's the whole RE to seek out repeating words # and collapse them into one word. $test =~ s/(\b\w+\b)(?:\s*\1)+/$1/g; print $test;
In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement.
Update: Changed attribution from Kernighan to Knuth
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: most efficient regex to delete duplicate words
by jered (Initiate) on Aug 14, 2001 at 06:27 UTC |