Re: How do I check a string for dupicate text?

1) If you expect this to be the entire field:

$len = length($field);
$field1 = substr($field, 0, int($len/2));
$field2 = substr($field, -int($len/2));
$field = $field1 if ($field1 eq $field2);
[download]

2) Handles spaces better:

$field =~ s/^(.+)\s*\1/$1/;
[download]

3) Handles duplicates anywhere in the field:

$field =~ s/(.{2,})\s*\1/$1/g;
[download]

Update: 4) Handles duplicate anywhere in the field, stops on word boundaries

$field =~ s/\b(.+)\b\s*\1\b/$1/g;
[download]

Test cases for all four follow

sub test1 {
   my $len = length($_[0]);
   my $part1 = substr($_[0], 0, int($len/2));
   my $part2 = substr($_[0], -int($len/2));
   $_[0] = $part1 if ($part1 eq $part2);
}

sub test2 {
   $_[0] =~ s/^(.+)\s*\1/$1/;
}

sub test3 {
   $_[0] =~ s/(.{2,})\s*\1/$1/g;
}

sub test4 {
   $_[0] =~ s/\b(.+)\b\s*\1\b/$1/g;
}

foreach $test (qw( test1 test2 test3 test4 )) {
   print($test, "\n");

   foreach (
      'John SmithJohn Smith',
      'John Smith John Smith',
      'John Smith  John Smith',
      'foo John Smith John Smith bar',
      'John Johnson',
      'foo John Johnson bar',
      'John Smith!John Smith',
   ) {
      my $field = $_;
      &$test($field);
      print($field, "\n");
   }

   print("\n");
}

__END__
output
======
test1
John Smith
John Smith
John Smith  John Smith         <-- case not covererd
foo John Smith John Smith bar  <-- case not covererd
John Johnson
foo John Johnson bar           <-- case not covererd
John Smith                     <-- slightly buggy

test2
John Smith
John Smith
John Smith
foo John Smith John Smith bar  <-- case not covererd
Johnson                        <-- buggy
foo John Johnson bar           <-- case not covererd
John Smith!John Smith

test3
John Smith
John Smith
John Smith
foo John Smith bar
Johnson                        <-- buggy
foo Johnson bar                <-- buggy
John Smith!John Smith

test4
John Smith
John Smith
John Smith
foo John Smith bar
John Johnson
foo John Johnson bar
John Smith!John Smith
[download]

Comment on Re: How do I check a string for dupicate text? Select or Download Code

Replies are listed 'Best First'.
Re^2: How do I check a string for dupicate text? by Not_a_Number (Prior) on Sep 09, 2004 at 17:23 UTC
`s/(.{2,})\s*\1/$1/g` Beware if you use this! Anybody called 'John Johnson' or 'Jo Jones' will lose their first name.	[reply] [d/l]
Re^3: How do I check a string for dupicate text? by ikegami (Patriarch) on Sep 09, 2004 at 17:37 UTC
Added (4) which fixes this up.	[reply]
Re^4: How do I check a string for dupicate text? by ysth (Canon) on Sep 09, 2004 at 19:06 UTC
You should still require 2 words; otherwise poor Johnson Johnson will suffer.	[reply]
Re^5: How do I check a string for dupicate text? by ikegami (Patriarch) on Sep 09, 2004 at 19:52 UTC
Re^6: How do I check a string for dupicate text? by davido (Cardinal) on Sep 10, 2004 at 05:49 UTC
Re^4: How do I check a string for dupicate text? by devgoddess (Acolyte) on Sep 09, 2004 at 18:33 UTC
Oh, hey, that's cool. I already handed the script over, but if there's huge issue with Jo Jones or John Johnson losing their first names, I can just send my client a fixed version of the script. He's already run the thing, and it's a huge success. :-) Dev Goddess Developer / Analyst / Criminal Mastermind "Size doesn't matter. It's all about speed and performance."	[reply]
Re^4: How do I check a string for dupicate text? by Not_a_Number (Prior) on Sep 11, 2004 at 19:02 UTC
`$field =~ s/\b(.+)\b\s\1\b/$1/g;` Beware!* Depending on what else there is in the string ~~after~~ apart from the (possibly repeated) firstname + lastname, this regex can be highly dangerous. Consider: `my $field = "Jo Doe (Tel: 999-111-111)"; $field =~ s/\b(.+)\b\s*\1\b/$1/g; print $field;` [download] Oops! What's happened to Joe Doe's phone number??	[reply] [d/l] [select]
Re^5: How do I check a string for dupicate text? by ikegami (Patriarch) on Sep 11, 2004 at 19:48 UTC
Re^2: How do I check a string for dupicate text? by devgoddess (Acolyte) on Sep 09, 2004 at 16:53 UTC
OMG! Holy crap! That last line worked perfectly. You're my new hero! LOL Dev Goddess Developer / Analyst / Criminal Mastermind "Size doesn't matter. It's all about speed and performance."	[reply]

Beware if you use this!