Don't forget to watch out for phrases like "Get the theist!" or "He would've vented had he had the time." Some of the above regexes would improperly remove the seemingly duplicated words ("the" and "ve") from the middles of those phrases.
You can avoid this problem with a positive lookahead (props to japhy for the compiled word and nonword regexes):
$word = qr/ \w [\w'-]* /x; $nonword = qr/ [^\w'-]+ /x; $text =~ s/ \b ($word) $nonword (?= \1 $nonword ) //xg;
This looks ahead to make sure that second and seemingly duplicated word is actually a separate word, and not part of another one. You can't simply use a \b at the end, since that would still mishandle phrases like "The can can't cant?"
On the other hand, I have no idea whether this is faster or more efficient. Plus, you'll still have to watch out for intentionally duplicated words, as in "He didn't know that that regex was going to do so much damage."
-jehuni
In reply to Re: most efficient regex to delete duplicate words
by jehuni
in thread most efficient regex to delete duplicate words
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |