most efficient regex to delete duplicate words

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: most efficient regex to delete duplicate words by Corion (Patriarch) on Aug 14, 2001 at 00:17 UTC
Premature optimization is the root of all evil (D. Knuth) So, before we look at how to do this fast, let's look at how to do it correctly :-) What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that : use strict; my $test = "alpha beta beta gamma gamma gamma delta"; # /(\b\w+\b)/ matches a single word : print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g; # but we want such a word, followed by (whitespace and then) the word +again : print "Double word : $1\n" if $test =~ /(\b\w+\b)\s\1/; # but we not only want to catch duplicates, we also want to catch # multiple repetitions : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s\1)+/; # And since we're throwing it away anyway, there's no need to actually # capture stuff into $2 : print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s\1)+/; # Now, here's the whole RE to seek out repeating words # and collapse them into one word. $test =~ s/(\b\w+\b)(?:\s\1)+/$1/g; print $test; [download] In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement. Update: Changed attribution from Kernighan to Knuth	[reply] [d/l]
Re: Re: most efficient regex to delete duplicate words by jered (Initiate) on Aug 14, 2001 at 06:27 UTC
Sorry, I just wanted to point out that Don Knuth said that, not Brian Kernighan. I know it seems trivial, but to me it's the difference between an idol and a god. Otherwise your post makes perfect sense to me. :)	[reply]
Re: most efficient regex to delete duplicate words (boo) by boo_radley (Parson) on Aug 14, 2001 at 00:25 UTC
Here's one way. `$_="alpha beta beta gamma gamma gamma"; while (s/((\w+)\s\2)/$2/) {}; print $_;` [download] the second line does something that may not be obvious to everyone, and seems to duplicate /g 's functionality. However, since you've (seemingly) got 3 gamma in a row, writing `$_="alpha beta beta gamma gamma gamma"; s/((\w+)\s\2)/$2/g; print $_;` [download] Will leave you with an extra gamma. Using the 'useless' while loop allows the regex to check for multiple duplicates. As for the regex you tried `$string =~ s/(\w+)(.)\b\1/$1 $2/sig;`, we have : one or more word characters (\w+) zero or more characters of any class (.) a word boundry the result of the first match The \w+ and the .* don't play nicely together, and this may be what's causing problems.	[reply] [d/l] [select]
Re: Re: most efficient regex to delete duplicate words (boo) by blakem (Monsignor) on Aug 14, 2001 at 02:37 UTC
I always perfer: `1 while (EXPR)` [download] instead of: `while (EXPR) {}` [download] Simply because the '1 while' sticks out at the front where as the empty {} tends to get lost. I find that '1 while' immediately flags this perl idiom and makes it easier for me to pick it out. -Blake	[reply] [d/l] [select]
Re: Re: Re: most efficient regex to delete duplicate words (boo) by Ven'Tatsu (Deacon) on Aug 14, 2001 at 07:32 UTC
`1 while (EXPR):` will also avoid the slight over head of entering and exiting the lexical context created by the BLOCK in `while (EXPR) {}` Perl might be smart enough to optimise the out the empty block though.	[reply] [d/l] [select]
Re: Re: Re: most efficient regex to delete duplicate words (boo) by mugwumpjism (Hermit) on Aug 14, 2001 at 03:50 UTC
What's wrong with: `"chill" while (EXPR);` ???	[reply] [d/l]
Re: most efficient regex to delete duplicate words by maverick (Curate) on Aug 14, 2001 at 00:40 UTC
Here's a non regexp solution. I'm assuming that the 'separator' between the words is unimportant. This will also remove duplicates that are not 'next to' each other within the string. Ie. The regexps here won't remove the second alpha from "alpha beta alpha". `my $string = "alpha beta beta gamma gamma gamma foo bar bar baz qux +qux qux"; # preseving the order my $i; my %h = map { $_ => $i++ } split(/[\b\s]+/,$string); print join(" ", sort { $h{$a} <=> $h{$b} } keys %h); print "\n"; # destroying the order my %h = map { $_ => 1 } split(/[\b\s]+/,$string); print join(" ",keys %h); print "\n";` [download] Yields: `alpha beta gamma foo bar baz qux foo gamma baz bar beta alpha qux` [download] /\/\averick perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"	[reply] [d/l] [select]
Re: most efficient regex to delete duplicate words by japhy (Canon) on Aug 14, 2001 at 07:31 UTC
If you mean consecutive words, then use something like the following: `$word = qr{ \w [\w'-]* }x; $nonword = qr{ [^\w'-]+ }x; $text =~ s{ \b ($word) (?: $nonword \1 )+ (?! \w ) # UPDATE }{$1}xg;` [download] Very simply, it finds a word followed by some non-word thing, followed by the same word (one or more times), and replaces it with the original word. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: most efficient regex to delete duplicate words by petdance (Parson) on Aug 14, 2001 at 05:18 UTC
Two points: I wouldn't worry about the efficiency until you get it working right. Doesn't matter how slow it is if it's incorrect. Why does it have to be a regex? `#!/usr/bin/perl -w use strict; my $test = "alpha beta beta gamma gamma gamma delta"; my %seen; my @out; for ( split( " ", $test ) ) { push( @out, $_ ) unless $seen{$_}++; } print join( " ", @out ), "\n";` [download] This solution keeps the order, dedupes throughout the entire string, and doesn't use any regexes, and I'd bet it's a good deal quicker than any regex solution. Of course, you'll want to measure if you want to be sure. xoxo, Andy -- <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>	[reply] [d/l]
Re: most efficient regex to delete duplicate words by runrig (Abbot) on Aug 14, 2001 at 01:42 UTC
This will scan the string once, with no backtracking: `# Assuming space delimited words and removing duplicates # not necessarily next to each other my $str = "abc def abc hello def 123 abc"; my %count; $str =~ s/((\w+)\s?)/$count{$2}++ ? '' : $1/eg; print $str,"\n";` [download]	[reply] [d/l]
Re: Re: most efficient regex to delete duplicate words by coolmichael (Deacon) on Aug 14, 2001 at 06:44 UTC
You need to be careful with that one, because it will mess with words that are repeated but aren't consecutive. It isn't clear if the poster wanted to trash only repeated consecutive words or all repeated words. Michael. Update:Right in front of me, and I still didn't read it right. Sorry runrig	[reply]
Re: most efficient regex to delete duplicate words by Anonymous Monk on Aug 14, 2001 at 00:33 UTC
A simple start. $string =~ s/(\w+)(\W\1)*/\1/g; DelPapa	[reply]
Re: Re: most efficient regex to delete duplicate words by coolmichael (Deacon) on Aug 14, 2001 at 06:41 UTC
I'm still in the process of learning regexes, but I would think that `$strint=~s/(\w+)(?:\W\1)+/\1/g;` would be a little faster, as you don't need to store the second parenthesis match thingy, and you don't need to match zero or more copies of the first match, you need to match one or more. Can someone please correct me or confirm this thought? Thanks. Michael PS: I've also been thinking about the /o modifier. Would this be a good place to use it?	[reply] [d/l]
Re: Re: Re: most efficient regex to delete duplicate words by blakem (Monsignor) on Aug 14, 2001 at 11:00 UTC
PS: I've also been thinking about the /o modifier. Would this be a good place to use it? $strint=~s/(\w+)(?:\W\1)+/\1/g; Nope. And heres why. The /o modifier relates specifically to interpolated variables inside of the regex. For instance, the regex: `m/$matchvar/` Is a candidate for the /o modifier, since it will be built from the value of $matchvar. With the /o modifier, the resulting "regex engine" will be cached for future use. The next time we run across this regex, we'll use the same engine, even if $matchvar has changed. The common analogy is that the /o modifier is like a promise. You promise not to change any of the variables in the regex, in return for better performance. If you break your promise, your program will break as well. Your original regex has no interpolated variables, so the regex is only compiled once and we use that copy everytime regardless of the /o modifier. In other words, the /o modifier would be redundant on your regex. -Blake	[reply] [d/l]
Re: most efficient regex to delete duplicate words by jehuni (Pilgrim) on Aug 14, 2001 at 14:40 UTC
Don't forget to watch out for phrases like "Get the theist!" or "He would've vented had he had the time." Some of the above regexes would improperly remove the seemingly duplicated words ("the" and "ve") from the middles of those phrases. You can avoid this problem with a positive lookahead (props to japhy for the compiled word and nonword regexes): `$word = qr/ \w [\w'-]* /x; $nonword = qr/ [^\w'-]+ /x; $text =~ s/ \b ($word) $nonword (?= \1 $nonword ) //xg;` [download] This looks ahead to make sure that second and seemingly duplicated word is actually a separate word, and not part of another one. You can't simply use a \b at the end, since that would still mishandle phrases like "The can can't cant?" On the other hand, I have no idea whether this is faster or more efficient. Plus, you'll still have to watch out for intentionally duplicated words, as in "He didn't know that that regex was going to do so much damage." -jehuni	[reply] [d/l]
Re: most efficient regex to delete duplicate words by Anonymous Monk on Aug 14, 2001 at 14:44 UTC
i think you can try using associative array for this: `@arr1 = qw(alpha beta beta gamma gamma gamma); undef %arr2; @arr2{@arr1} = (); @arr1 = keys(%arr2);` [download] @arr1 will now contain "alpha beta gamma". posted by ravi.singh@seepz.tcs.co.in Edit Masem 2001-08-15 - Added code tags.	[reply] [d/l]