Premature optimization is the root of all evil (D. Knuth)

So, before we look at how to do this fast, let's look at how to do it correctly :-)

What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that :

use strict; my $test = "alpha beta beta gamma gamma gamma delta"; # /(\b\w+\b)/ matches a single word : print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g; # but we want such a word, followed by (whitespace and then) the word +again : print "Double word : $1\n" if $test =~ /(\b\w+\b)\s*\1/; # but we not only want to catch duplicates, we also want to catch # multiple repetitions : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s*\1)+/; # And since we're throwing it away anyway, there's no need to actually # capture stuff into $2 : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s*\1)+/; # Now, here's the whole RE to seek out repeating words # and collapse them into one word. $test =~ s/(\b\w+\b)(?:\s*\1)+/$1/g; print $test;

In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement.

Update: Changed attribution from Kernighan to Knuth


In reply to Re: most efficient regex to delete duplicate words by Corion
in thread most efficient regex to delete duplicate words by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.