in reply to How do I check a string for dupicate text?

1) If you expect this to be the entire field:

$len = length($field); $field1 = substr($field, 0, int($len/2)); $field2 = substr($field, -int($len/2)); $field = $field1 if ($field1 eq $field2);

2) Handles spaces better:

$field =~ s/^(.+)\s*\1/$1/;

3) Handles duplicates anywhere in the field:

$field =~ s/(.{2,})\s*\1/$1/g;

Update: 4) Handles duplicate anywhere in the field, stops on word boundaries

$field =~ s/\b(.+)\b\s*\1\b/$1/g;
Test cases for all four follow
sub test1 { my $len = length($_[0]); my $part1 = substr($_[0], 0, int($len/2)); my $part2 = substr($_[0], -int($len/2)); $_[0] = $part1 if ($part1 eq $part2); } sub test2 { $_[0] =~ s/^(.+)\s*\1/$1/; } sub test3 { $_[0] =~ s/(.{2,})\s*\1/$1/g; } sub test4 { $_[0] =~ s/\b(.+)\b\s*\1\b/$1/g; } foreach $test (qw( test1 test2 test3 test4 )) { print($test, "\n"); foreach ( 'John SmithJohn Smith', 'John Smith John Smith', 'John Smith John Smith', 'foo John Smith John Smith bar', 'John Johnson', 'foo John Johnson bar', 'John Smith!John Smith', ) { my $field = $_; &$test($field); print($field, "\n"); } print("\n"); } __END__ output ====== test1 John Smith John Smith John Smith John Smith <-- case not covererd foo John Smith John Smith bar <-- case not covererd John Johnson foo John Johnson bar <-- case not covererd John Smith <-- slightly buggy test2 John Smith John Smith John Smith foo John Smith John Smith bar <-- case not covererd Johnson <-- buggy foo John Johnson bar <-- case not covererd John Smith!John Smith test3 John Smith John Smith John Smith foo John Smith bar Johnson <-- buggy foo Johnson bar <-- buggy John Smith!John Smith test4 John Smith John Smith John Smith foo John Smith bar John Johnson foo John Johnson bar John Smith!John Smith

Replies are listed 'Best First'.
Re^2: How do I check a string for dupicate text?
by Not_a_Number (Prior) on Sep 09, 2004 at 17:23 UTC
    s/(.{2,})\s*\1/$1/g

    Beware if you use this!

    Anybody called 'John Johnson' or 'Jo Jones' will lose their first name.

      Added (4) which fixes this up.
        You should still require 2 words; otherwise poor Johnson Johnson will suffer.
        Oh, hey, that's cool. I already handed the script over, but if there's huge issue with Jo Jones or John Johnson losing their first names, I can just send my client a fixed version of the script.

        He's already run the thing, and it's a huge success. :-)

        Dev Goddess
        Developer / Analyst / Criminal Mastermind

        "Size doesn't matter. It's all about speed and performance."

        $field =~ s/\b(.+)\b\s*\1\b/$1/g;

        Beware! Depending on what else there is in the string after apart from the (possibly repeated) firstname + lastname, this regex can be highly dangerous.

        Consider:

        my $field = "Jo Doe (Tel: 999-111-111)"; $field =~ s/\b(.+)\b\s*\1\b/$1/g; print $field;

        Oops! What's happened to Joe Doe's phone number??

Re^2: How do I check a string for dupicate text?
by devgoddess (Acolyte) on Sep 09, 2004 at 16:53 UTC
    OMG! Holy crap! That last line worked perfectly. You're my new hero! *LOL*

    Dev Goddess
    Developer / Analyst / Criminal Mastermind

    "Size doesn't matter. It's all about speed and performance."