Re: most efficient regex to delete duplicate words
by Corion (Patriarch) on Aug 14, 2001 at 00:17 UTC
|
Premature optimization is the root of all evil (D. Knuth)
So, before we look at how to do this fast, let's look at how to do it correctly :-)
What we are looking for is a word, that is, a sequence of characters between word boundaries, which is followed by one or more instances of itself. So let's build a RE for that :
use strict;
my $test = "alpha beta beta gamma gamma gamma delta";
# /(\b\w+\b)/ matches a single word :
print "Word in string : $1\n" while $test =~ /(\b\w+\b)/g;
# but we want such a word, followed by (whitespace and then) the word
+again :
print "Double word : $1\n" if $test =~ /(\b\w+\b)\s*\1/;
# but we not only want to catch duplicates, we also want to catch
# multiple repetitions :
print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(\s*\1)+/;
# And since we're throwing it away anyway, there's no need to actually
# capture stuff into $2 :
print "Double or more word : $1\n" if $test =~ /(\b\w+\b)(?:\s*\1)+/;
# Now, here's the whole RE to seek out repeating words
# and collapse them into one word.
$test =~ s/(\b\w+\b)(?:\s*\1)+/$1/g;
print $test;
In this short time, I didn't yet give much thought about optimization, and I think that a regular expression string replace might possibly not be the fastest solution, some string scanning might be faster, but as you don't give much context, I won't be of much help in that departement.
Update: Changed attribution from Kernighan to Knuth
| [reply] [d/l] |
|
|
Sorry, I just wanted to point out that Don Knuth said that, not Brian Kernighan. I know it seems trivial, but to me it's the difference between an idol and a god. Otherwise your post makes perfect sense to me. :)
| [reply] |
Re: most efficient regex to delete duplicate words (boo)
by boo_radley (Parson) on Aug 14, 2001 at 00:25 UTC
|
$_="alpha beta beta gamma gamma gamma";
while (s/((\w+)\s\2)/$2/) {};
print $_;
the second line does something that may not be obvious to everyone, and seems to duplicate /g 's functionality. However, since you've (seemingly) got 3 gamma in a row, writing
$_="alpha beta beta gamma gamma gamma";
s/((\w+)\s\2)/$2/g;
print $_;
Will leave you with an extra gamma. Using the 'useless' while loop allows the regex to check for multiple duplicates.
As for the regex you tried $string =~ s/(\w+)(.*)\b\1/$1 $2/sig;, we have :
- one or more word characters (\w+)
- zero or more characters of any class (.*)
- a word boundry
- the result of the first match
The \w+ and the .* don't play nicely together, and this may be what's causing problems. | [reply] [d/l] [select] |
|
|
1 while (EXPR)
instead of:
while (EXPR) {}
Simply because the '1 while' sticks out at the front where as the empty {} tends to get lost. I find that '1 while' immediately flags this perl idiom and makes it easier
for me to pick it out.
-Blake | [reply] [d/l] [select] |
|
|
1 while (EXPR):
will also avoid the slight over head of entering and exiting the lexical context created by the BLOCK in
while (EXPR) {}
Perl might be smart enough to optimise the out the empty block though.
| [reply] [d/l] [select] |
|
|
What's wrong with:
"chill" while (EXPR);
???
| [reply] [d/l] |
Re: most efficient regex to delete duplicate words
by maverick (Curate) on Aug 14, 2001 at 00:40 UTC
|
Here's a non regexp solution. I'm assuming that the 'separator' between the words is unimportant. This will also remove duplicates that are not 'next to' each other within the string. Ie. The regexps here won't remove the second alpha from "alpha beta alpha".
my $string = "alpha beta beta gamma gamma gamma foo bar bar baz qux
+qux qux";
# preseving the order
my $i;
my %h = map { $_ => $i++ } split(/[\b\s]+/,$string);
print join(" ", sort { $h{$a} <=> $h{$b} } keys %h);
print "\n";
# destroying the order
my %h = map { $_ => 1 } split(/[\b\s]+/,$string);
print join(" ",keys %h);
print "\n";
Yields:
alpha beta gamma foo bar baz qux
foo gamma baz bar beta alpha qux
/\/\averick
perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"
| [reply] [d/l] [select] |
Re: most efficient regex to delete duplicate words
by japhy (Canon) on Aug 14, 2001 at 07:31 UTC
|
If you mean consecutive words, then use something like the following:
$word = qr{ \w [\w'-]* }x;
$nonword = qr{ [^\w'-]+ }x;
$text =~ s{
\b
($word)
(?: $nonword \1 )+
(?! \w ) # UPDATE
}{$1}xg;
Very simply, it finds a word followed by some non-word thing, followed by the same word (one or more times), and replaces it with the original word.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] [d/l] |
Re: most efficient regex to delete duplicate words
by petdance (Parson) on Aug 14, 2001 at 05:18 UTC
|
Two points:
- I wouldn't worry about the efficiency until you get
it working right. Doesn't matter how slow it is if it's incorrect.
- Why does it have to be a regex?
#!/usr/bin/perl -w
use strict;
my $test = "alpha beta beta gamma gamma gamma delta";
my %seen;
my @out;
for ( split( " ", $test ) ) {
push( @out, $_ ) unless $seen{$_}++;
}
print join( " ", @out ), "\n";
This solution keeps the order, dedupes throughout the entire string, and doesn't use any regexes, and I'd bet it's a good deal quicker than any regex solution. Of course, you'll want to measure if you want to be sure.
xoxo,
Andy
--
<megaphone>
Throw down the gun and tiara and come out of the float!
</megaphone>
| [reply] [d/l] |
Re: most efficient regex to delete duplicate words
by runrig (Abbot) on Aug 14, 2001 at 01:42 UTC
|
This will scan the string once, with no backtracking: # Assuming space delimited words and removing duplicates
# not necessarily next to each other
my $str = "abc def abc hello def 123 abc";
my %count;
$str =~ s/((\w+)\s?)/$count{$2}++ ? '' : $1/eg;
print $str,"\n";
| [reply] [d/l] |
|
|
| [reply] |
Re: most efficient regex to delete duplicate words
by Anonymous Monk on Aug 14, 2001 at 00:33 UTC
|
A simple start.
$string =~ s/(\w+)(\W\1)*/\1/g;
DelPapa | [reply] |
|
|
I'm still in the process of learning regexes, but I would think that
$strint=~s/(\w+)(?:\W\1)+/\1/g;
would be a little faster, as you don't need to store the second parenthesis match thingy, and you don't need to match zero or more copies of the first match, you need to match one or more. Can someone please correct me or confirm this thought? Thanks.
Michael
PS: I've also been thinking about the /o modifier. Would this be a good place to use it?
| [reply] [d/l] |
|
|
PS: I've also been thinking about the /o modifier. Would this be a good place to use it?
$strint=~s/(\w+)(?:\W\1)+/\1/g;
Nope. And heres why.
The /o modifier relates specifically to interpolated variables inside of the regex. For instance, the regex:
m/$matchvar/
Is a candidate for the /o modifier, since it will be built from the value of $matchvar. With the /o modifier, the resulting "regex engine" will be cached for future use. The next time we run across this regex, we'll use the same engine, even if $matchvar has changed.
The common analogy is that the /o modifier is like a promise.
You promise not to change any of the variables in the regex, in return for better performance. If you
break your promise, your program will break as well.
Your original regex has no interpolated variables, so the regex
is only compiled once and we use that copy everytime regardless of the /o modifier.
In other words, the /o modifier would be redundant on your regex.
-Blake
| [reply] [d/l] |
Re: most efficient regex to delete duplicate words
by jehuni (Pilgrim) on Aug 14, 2001 at 14:40 UTC
|
Don't forget to watch out for phrases like "Get the theist!" or "He would've vented had he had the time." Some of the above regexes would improperly remove the seemingly duplicated words ("the" and "ve") from the middles of those phrases.
You can avoid this problem with a positive lookahead (props to japhy for the compiled word and nonword regexes):
$word = qr/ \w [\w'-]* /x;
$nonword = qr/ [^\w'-]+ /x;
$text =~ s/ \b ($word) $nonword (?= \1 $nonword ) //xg;
This looks ahead to make sure that second and seemingly duplicated word is actually a separate word, and not part of another one. You can't simply use a \b at the end, since that would still mishandle phrases like "The can can't cant?"
On the other hand, I have no idea whether this is faster or more efficient. Plus, you'll still have to watch out for intentionally duplicated words, as in "He didn't know that that regex was going to do so much damage."
-jehuni
| [reply] [d/l] |
Re: most efficient regex to delete duplicate words
by Anonymous Monk on Aug 14, 2001 at 14:44 UTC
|
i think you can try using associative array for this:
@arr1 = qw(alpha beta beta gamma gamma gamma);
undef %arr2;
@arr2{@arr1} = ();
@arr1 = keys(%arr2);
@arr1 will now contain "alpha beta gamma".
posted by
ravi.singh@seepz.tcs.co.in
Edit Masem 2001-08-15 - Added code tags. | [reply] [d/l] |