in reply to Re: Removing duplicate substrings from a string - is a regex possible?
in thread Removing duplicate substrings from a string - is a regex possible?

Hey, that's pretty good, arhuman. I will use that as a starting point...

To be more precise:

Examples:

I read Mastering Regular Expressions a while ago, and I'll read Japhy's book when it comes out, but regular expressions sometimes mystify me. Is it easier for people with the name Jeffrey for some reason?

-- iakobski

  • Comment on Re: Re: Removing duplicate substrings from a string - is a regex possible?

Replies are listed 'Best First'.
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible?
by Hofmator (Curate) on Jul 23, 2001 at 18:40 UTC

    I would strongly discourage regexes in this case!! It's a lot trickier than the split/hash solution (and a lot trickier than you think :). Take as an example the solution $data=~s!((.*)/.*/)\2!$1!g; from arhuman which doesn't work correctly even for your first example (NY/Ch/NY), as it leaves a '/' at the end of the string. And much more important it breaks in nearly all other cases. The problems include

    • greedyness of .*
    • correct removal of '/'
    • consider sth like 'abc/def/abcd' (not a real life example but a similar thing might come up) and 'abc' will be deleted
    • speed: the regex solution makes a comparison for pairs, the first word with all others, then the second word with all others. This results in an O(n^2) algorithm (i.e. double the number of elements in the string, quadruple the execution time) whereas the split/hash solution is O(n)

    -- Hofmator

Re: Re: Re: Removing duplicate substrings from a string - is a regex possible?
by arhuman (Vicar) on Jul 23, 2001 at 19:23 UTC
    I think that with those new elements Hofmator is right,
    regexes aren't probably the good solution.

    Especially with other simple answers available like :
    print join'/',grep{if(!$u{$_}){$u{$_}=1}}split'/',$data;

    Note to hofmator: don't hit me too hard...
    My (poor) regex was not meant to be a solution but rather to show a possible way to study
    (wasn't approach bold enough ? ;-)
    Anyway I++ your post beccause your rightly outline some of the problems that may arise with the regex approach...
    and in particular in my poor regex approach


    "Only Bad Coders Code Badly In Perl" (OBC2BIP)
Re: Re: Re: Removing duplicate substrings from a string - is a regex possible?
by japhy (Canon) on Jul 24, 2001 at 00:04 UTC
    /Jeff(?:rey)?/

    Well, I have two comments: a regex is a cool approach to take, and a regex is the wrong approach to take. You said yourself you could split and use a hash. That's what I'd do. I'd use a regex if I were trying to impress someone (which I'm not). But if you are, then here's my offering:

    $cities = "Here/There/Everywhere/There/Again/Here/Too"; $cities =~ s{ ([^/]+) # non-slashes (the city): \1 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )*? # 0 or more times, non-greedily \1 # is the city here again? (?: /|$ ) # a / or end-of-string ) } {}gx;

    _____________________________________________________
    Jeff japhy Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Right argument, japhy!! But it's an interesting problem nevertheless. Your regex breaks on: New York/York/Boston => New York/Boston My fix makes it sadly less elegant:

      1 while s{ (/|^) # add: capture slash or start of string: $1 ([^/]+) # non-slashes (the city): \2 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )*? # 0 or more times, non-greedily \2 # is the city here again? (?: /|$ ) # a / or end-of-string ) } {$1}gx; # add: replacement
      Now this is close to my approach:  1 while s#(/|^)((?>[^/]*))/(?=(?:.*?/)?\2(?:/|$))#$1#g; The advantage of both solutions is that as little as possible is replaced. The disadvantage is that a regex isn't the right way to do it ;) - for all the reasons already given in this thread.

      -- Hofmator