iakobski has asked for the wisdom of the Perl Monks concerning the following question:

I have a string that looks something like this:

New York/Chicago/New York

and I want to remove the duplicates. I could split, insert into a hash and then reform the string, but is there a better way? I thought there might be a regex for this, but it eludes me. I'm going to try join-grep-split, but if anyone has any bright ideas I'd be interested.

-- iakobski

  • Comment on Removing duplicate substrings from a string - is a regex possible?

Replies are listed 'Best First'.
Re: Removing duplicate substrings from a string - is a regex possible?
by arhuman (Vicar) on Jul 23, 2001 at 17:46 UTC
    Something like :
    $data=~s/((.*)\/.*\/)\2/$1/g;
    Could be a an approach.
    But please give us more info about what you're manipulating for we can handle the (always present) special cases....

    For a more precise answer, you should for example define :
    How many times a string could be repeated (only once, more?),
    what duplicates should be keept (first one, last one ?), if the duplicate can include a slash...

    UPDATE : But as always all the useful info are in:
    "Only Bad Coders Code Badly In Perl" (OBC2BIP)
      Hey, that's pretty good, arhuman. I will use that as a starting point...

      To be more precise:

      • There is a list of cities each separated by a slash
      • There might be some duplicate cities anywhere in the list (ie 0, 1 or many duplicates)
      • I want to form a list in any order without any duplicates, each separated by a slash (only one)
      Examples:
      • New York/Chicago/New York
      • New York/Chicago/New York/Chicago/Buffalo
      • Buffalo/New York/Chicago/New York/Chicago

      I read Mastering Regular Expressions a while ago, and I'll read Japhy's book when it comes out, but regular expressions sometimes mystify me. Is it easier for people with the name Jeffrey for some reason?

      -- iakobski

        I would strongly discourage regexes in this case!! It's a lot trickier than the split/hash solution (and a lot trickier than you think :). Take as an example the solution $data=~s!((.*)/.*/)\2!$1!g; from arhuman which doesn't work correctly even for your first example (NY/Ch/NY), as it leaves a '/' at the end of the string. And much more important it breaks in nearly all other cases. The problems include

        • greedyness of .*
        • correct removal of '/'
        • consider sth like 'abc/def/abcd' (not a real life example but a similar thing might come up) and 'abc' will be deleted
        • speed: the regex solution makes a comparison for pairs, the first word with all others, then the second word with all others. This results in an O(n^2) algorithm (i.e. double the number of elements in the string, quadruple the execution time) whereas the split/hash solution is O(n)

        -- Hofmator

        I think that with those new elements Hofmator is right,
        regexes aren't probably the good solution.

        Especially with other simple answers available like :
        print join'/',grep{if(!$u{$_}){$u{$_}=1}}split'/',$data;

        Note to hofmator: don't hit me too hard...
        My (poor) regex was not meant to be a solution but rather to show a possible way to study
        (wasn't approach bold enough ? ;-)
        Anyway I++ your post beccause your rightly outline some of the problems that may arise with the regex approach...
        and in particular in my poor regex approach


        "Only Bad Coders Code Badly In Perl" (OBC2BIP)
        /Jeff(?:rey)?/

        Well, I have two comments: a regex is a cool approach to take, and a regex is the wrong approach to take. You said yourself you could split and use a hash. That's what I'd do. I'd use a regex if I were trying to impress someone (which I'm not). But if you are, then here's my offering:

        $cities = "Here/There/Everywhere/There/Again/Here/Too"; $cities =~ s{ ([^/]+) # non-slashes (the city): \1 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )*? # 0 or more times, non-greedily \1 # is the city here again? (?: /|$ ) # a / or end-of-string ) } {}gx;

        _____________________________________________________
        Jeff japhy Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: Removing duplicate substrings from a string - is a regex possible?
by wine (Scribe) on Jul 23, 2001 at 18:26 UTC

    OK, this is probably not the most beautiful and compact regex, but it works:

    my $string = "New York/Chicago/New York/Chicago/Boston"; 1 while $string =~ s!(^|/)([^/]+)(.*)/\2(/|$)!$1$2$3$4!g; print $string; #outputs "New York/Chicago/Boston"

    Update Well of course this is an abuse of regex. I would not use it myself nor recommend it, but it's a nice exercise.

Re: Removing duplicate substrings from a string - is a regex possible?
by thpfft (Chaplain) on Jul 23, 2001 at 19:36 UTC

    Here's a hybrid solution. I'm inclined to agree that join(grep(split())) is going to be more robust, but this might be more economical:

    $_ = '/New York/Chicago/New York/Boston/Houston/Chicago/Seattle/'; my %tally; s/\/([^\/]+)/($tally{$1}++) ? '' : "\/$1"/ge; print $_; __END__ prints /New York/Chicago/Boston/Houston/Washington/

    Note that this requires a leading / if it's to remove duplicates of the first item in the list. Split handles that sort of thing better...

Re: Removing duplicate substrings from a string - is a regex possible?
by ginseng (Pilgrim) on Jul 23, 2001 at 19:38 UTC

    I couldn't help but thinking 'sort | uniq', so I did a search on uniq and came up with == uniq, which looks promising. I've read the comments warning against the regex solutions and have to agree, so you may wish to view MeowChow's array code for the same.