devgoddess has asked for the wisdom of the Perl Monks concerning the following question:

I have what I feel is the stupidest question. I'm working with name fields in a flat file. Most of them are fine, but some have double names. For example, a field may have "John Smith John Smith". I need to take the first 2 words of the string, "John Smith" (or first and last name), and check the rest of the string for another occurence of that name. Then I need to chop off the 2nd instance of the name.

How do I do that? Thanks in advance, and please have mercy on me. I really can't figure this out. I'm sure that, as usual, it's something simple and stupid I'm overlooking.

Dev Goddess
Developer / Analyst / Criminal Mastermind

"Size doesn't matter. It's all about speed and performance."

  • Comment on How do I check a string for dupicate text?

Replies are listed 'Best First'.
Re: How do I check a string for dupicate text?
by ikegami (Patriarch) on Sep 09, 2004 at 16:44 UTC

    1) If you expect this to be the entire field:

    $len = length($field); $field1 = substr($field, 0, int($len/2)); $field2 = substr($field, -int($len/2)); $field = $field1 if ($field1 eq $field2);

    2) Handles spaces better:

    $field =~ s/^(.+)\s*\1/$1/;

    3) Handles duplicates anywhere in the field:

    $field =~ s/(.{2,})\s*\1/$1/g;

    Update: 4) Handles duplicate anywhere in the field, stops on word boundaries

    $field =~ s/\b(.+)\b\s*\1\b/$1/g;
    Test cases for all four follow
      s/(.{2,})\s*\1/$1/g

      Beware if you use this!

      Anybody called 'John Johnson' or 'Jo Jones' will lose their first name.

        Added (4) which fixes this up.
      OMG! Holy crap! That last line worked perfectly. You're my new hero! *LOL*

      Dev Goddess
      Developer / Analyst / Criminal Mastermind

      "Size doesn't matter. It's all about speed and performance."

Re: How do I check a string for dupicate text?
by Anonymous Monk on Sep 10, 2004 at 09:46 UTC
    s/^ # Start of string ( # Start remembering \W* # Leading non word characters (\w+) # First word of string \s+ # Whitespace (\w+) # Second word of string \b # Make sure we got the entire word .*? # Skip till second occurance ) # Stop remembering \b # Start at the beginning of a word \2 # Repeat of the first word \s+ # Whitespace \3 # Repeat of the second word \b # And that's the end of the word /$1/sx; # Just keep what we remembered