Re: Re: Removing duplicate substrings from a string

Hey, that's pretty good, arhuman. I will use that as a starting point...

To be more precise:

There is a list of cities each separated by a slash
There might be some duplicate cities anywhere in the list (ie 0, 1 or many duplicates)
I want to form a list in any order without any duplicates, each separated by a slash (only one)

Examples:

New York/Chicago/New York
New York/Chicago/New York/Chicago/Buffalo
Buffalo/New York/Chicago/New York/Chicago

I read Mastering Regular Expressions a while ago, and I'll read Japhy's book when it comes out, but regular expressions sometimes mystify me. Is it easier for people with the name Jeffrey for some reason?

-- iakobski

Comment on Re: Re: Removing duplicate substrings from a string - is a regex possible?

Replies are listed 'Best First'.
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible? by Hofmator (Curate) on Jul 23, 2001 at 18:40 UTC
I would strongly discourage regexes in this case!! It's a lot trickier than the split/hash solution (and a lot trickier than you think :). Take as an example the solution `$data=~s!((.)/./)\2!$1!g;` from arhuman which doesn't work correctly even for your first example (NY/Ch/NY), as it leaves a '/' at the end of the string. And much more important it breaks in nearly all other cases. The problems include greedyness of .* correct removal of '/' consider sth like 'abc/def/abcd' (not a real life example but a similar thing might come up) and 'abc' will be deleted speed: the regex solution makes a comparison for pairs, the first word with all others, then the second word with all others. This results in an O(n^2) algorithm (i.e. double the number of elements in the string, quadruple the execution time) whereas the split/hash solution is O(n) -- Hofmator	[reply] [d/l]
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible? by arhuman (Vicar) on Jul 23, 2001 at 19:23 UTC
I think that with those new elements Hofmator is right, regexes aren't probably the good solution. Especially with other simple answers available like : `print join'/',grep{if(!$u{$_}){$u{$_}=1}}split'/',$data;` [download] Note to hofmator: don't hit me too hard... My (poor) regex was not meant to be a solution but rather to show a possible way to study (wasn't *approach* bold enough ? ;-) Anyway I++ your post beccause your rightly outline some of the problems that may arise with the regex approach... and in particular in my poor regex approach "Only Bad Coders Code Badly In Perl" (OBC2BIP)	[reply] [d/l]
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible? by japhy (Canon) on Jul 24, 2001 at 00:04 UTC
`/Jeff(?:rey)?/` Well, I have two comments: a regex is a cool approach to take, and a regex is the wrong approach to take. You said yourself you could split and use a hash. That's what I'd do. I'd use a regex if I were trying to impress someone (which I'm not). But if you are, then here's my offering: `$cities = "Here/There/Everywhere/There/Again/Here/Too"; $cities =~ s{ ([^/]+) # non-slashes (the city): \1 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )? # 0 or more times, non-greedily \1 # is the city here again? (?: /\|$ ) # a / or end-of-string ) } {}gx;` [download] _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl* hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re4: Removing duplicate substrings from a string - is a regex possible? by Hofmator (Curate) on Jul 24, 2001 at 13:36 UTC
Right argument, japhy!! But it's an interesting problem nevertheless. Your regex breaks on: `New York/York/Boston => New York/Boston` My fix makes it sadly less elegant: `1 while s{ (/\|^) # add: capture slash or start of string: $1 ([^/]+) # non-slashes (the city): \2 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )? # 0 or more times, non-greedily \2 # is the city here again? (?: /\|$ ) # a / or end-of-string ) } {$1}gx; # add: replacement` [download] Now this is close to my approach: `1 while s#(/\|^)((?>[^/]))/(?=(?:.*?/)?\2(?:/\|$))#$1#g;` The advantage of both solutions is that as little as possible is replaced. The disadvantage is that a regex isn't the right way to do it ;) - for all the reasons already given in this thread. -- Hofmator	[reply] [d/l] [select]