Removing duplicate substrings from a string

iakobski has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Removing duplicate substrings from a string - is a regex possible? by arhuman (Vicar) on Jul 23, 2001 at 17:46 UTC
Something like : `$data=~s/((.)\/.\/)\2/$1/g;` [download] Could be a an approach. But please give us more info about what you're manipulating for we can handle the (always present) special cases.... For a more precise answer, you should for example define : How many times a string could be repeated (only once, more?), what duplicates should be keept (first one, last one ?), if the duplicate can include a slash... UPDATE : But as always all the useful info are in: perldoc perlre Mastering regular expressions Japhy's (upcoming) book "Only Bad Coders Code Badly In Perl" (OBC2BIP)	[reply] [d/l]
Re: Re: Removing duplicate substrings from a string - is a regex possible? by iakobski (Pilgrim) on Jul 23, 2001 at 18:09 UTC
Hey, that's pretty good, arhuman. I will use that as a starting point... To be more precise: There is a list of cities each separated by a slash There might be some duplicate cities anywhere in the list (ie 0, 1 or many duplicates) I want to form a list in any order without any duplicates, each separated by a slash (only one) Examples: New York/Chicago/New York New York/Chicago/New York/Chicago/Buffalo Buffalo/New York/Chicago/New York/Chicago I read Mastering Regular Expressions a while ago, and I'll read Japhy's book when it comes out, but regular expressions sometimes mystify me. Is it easier for people with the name Jeffrey for some reason? -- iakobski	[reply]
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible? by Hofmator (Curate) on Jul 23, 2001 at 18:40 UTC
I would strongly discourage regexes in this case!! It's a lot trickier than the split/hash solution (and a lot trickier than you think :). Take as an example the solution `$data=~s!((.)/./)\2!$1!g;` from arhuman which doesn't work correctly even for your first example (NY/Ch/NY), as it leaves a '/' at the end of the string. And much more important it breaks in nearly all other cases. The problems include greedyness of .* correct removal of '/' consider sth like 'abc/def/abcd' (not a real life example but a similar thing might come up) and 'abc' will be deleted speed: the regex solution makes a comparison for pairs, the first word with all others, then the second word with all others. This results in an O(n^2) algorithm (i.e. double the number of elements in the string, quadruple the execution time) whereas the split/hash solution is O(n) -- Hofmator	[reply] [d/l]
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible? by arhuman (Vicar) on Jul 23, 2001 at 19:23 UTC
I think that with those new elements Hofmator is right, regexes aren't probably the good solution. Especially with other simple answers available like : `print join'/',grep{if(!$u{$_}){$u{$_}=1}}split'/',$data;` [download] Note to hofmator: don't hit me too hard... My (poor) regex was not meant to be a solution but rather to show a possible way to study (wasn't *approach* bold enough ? ;-) Anyway I++ your post beccause your rightly outline some of the problems that may arise with the regex approach... and in particular in my poor regex approach "Only Bad Coders Code Badly In Perl" (OBC2BIP)	[reply] [d/l]
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible? by japhy (Canon) on Jul 24, 2001 at 00:04 UTC
`/Jeff(?:rey)?/` Well, I have two comments: a regex is a cool approach to take, and a regex is the wrong approach to take. You said yourself you could split and use a hash. That's what I'd do. I'd use a regex if I were trying to impress someone (which I'm not). But if you are, then here's my offering: `$cities = "Here/There/Everywhere/There/Again/Here/Too"; $cities =~ s{ ([^/]+) # non-slashes (the city): \1 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )? # 0 or more times, non-greedily \1 # is the city here again? (?: /\|$ ) # a / or end-of-string ) } {}gx;` [download] _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl* hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re4: Removing duplicate substrings from a string - is a regex possible? by Hofmator (Curate) on Jul 24, 2001 at 13:36 UTC
Re: Removing duplicate substrings from a string - is a regex possible? by wine (Scribe) on Jul 23, 2001 at 18:26 UTC
OK, this is probably not the most beautiful and compact regex, but it works: `my $string = "New York/Chicago/New York/Chicago/Boston"; 1 while $string =~ s!(^\|/)([^/]+)(.)/\2(/\|$)!$1$2$3$4!g; print $string; #outputs "New York/Chicago/Boston"` [download] Update* Well of course this is an abuse of regex. I would not use it myself nor recommend it, but it's a nice exercise.	[reply] [d/l]
Re: Removing duplicate substrings from a string - is a regex possible? by thpfft (Chaplain) on Jul 23, 2001 at 19:36 UTC
Here's a hybrid solution. I'm inclined to agree that join(grep(split())) is going to be more robust, but this might be more economical: `$_ = '/New York/Chicago/New York/Boston/Houston/Chicago/Seattle/'; my %tally; s/\/([^\/]+)/($tally{$1}++) ? '' : "\/$1"/ge; print $_; __END__ prints /New York/Chicago/Boston/Houston/Washington/` [download] Note that this requires a leading / if it's to remove duplicates of the first item in the list. Split handles that sort of thing better...	[reply] [d/l]
Re: Removing duplicate substrings from a string - is a regex possible? by ginseng (Pilgrim) on Jul 23, 2001 at 19:38 UTC
I couldn't help but thinking 'sort \| uniq', so I did a search on uniq and came up with == uniq, which looks promising. I've read the comments warning against the regex solutions and have to agree, so you may wish to view MeowChow's array code for the same.	[reply]