Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Removing duplicate substrings from a string - is a regex possible?

by arhuman (Vicar)
on Jul 23, 2001 at 17:46 UTC ( [id://99016]=note: print w/replies, xml ) Need Help??


in reply to Removing duplicate substrings from a string - is a regex possible?

Something like :
$data=~s/((.*)\/.*\/)\2/$1/g;
Could be a an approach.
But please give us more info about what you're manipulating for we can handle the (always present) special cases....

For a more precise answer, you should for example define :
How many times a string could be repeated (only once, more?),
what duplicates should be keept (first one, last one ?), if the duplicate can include a slash...

UPDATE : But as always all the useful info are in:
"Only Bad Coders Code Badly In Perl" (OBC2BIP)
  • Comment on Re: Removing duplicate substrings from a string - is a regex possible?
  • Download Code

Replies are listed 'Best First'.
Re: Re: Removing duplicate substrings from a string - is a regex possible?
by iakobski (Pilgrim) on Jul 23, 2001 at 18:09 UTC
    Hey, that's pretty good, arhuman. I will use that as a starting point...

    To be more precise:

    • There is a list of cities each separated by a slash
    • There might be some duplicate cities anywhere in the list (ie 0, 1 or many duplicates)
    • I want to form a list in any order without any duplicates, each separated by a slash (only one)
    Examples:
    • New York/Chicago/New York
    • New York/Chicago/New York/Chicago/Buffalo
    • Buffalo/New York/Chicago/New York/Chicago

    I read Mastering Regular Expressions a while ago, and I'll read Japhy's book when it comes out, but regular expressions sometimes mystify me. Is it easier for people with the name Jeffrey for some reason?

    -- iakobski

      I would strongly discourage regexes in this case!! It's a lot trickier than the split/hash solution (and a lot trickier than you think :). Take as an example the solution $data=~s!((.*)/.*/)\2!$1!g; from arhuman which doesn't work correctly even for your first example (NY/Ch/NY), as it leaves a '/' at the end of the string. And much more important it breaks in nearly all other cases. The problems include

      • greedyness of .*
      • correct removal of '/'
      • consider sth like 'abc/def/abcd' (not a real life example but a similar thing might come up) and 'abc' will be deleted
      • speed: the regex solution makes a comparison for pairs, the first word with all others, then the second word with all others. This results in an O(n^2) algorithm (i.e. double the number of elements in the string, quadruple the execution time) whereas the split/hash solution is O(n)

      -- Hofmator

      I think that with those new elements Hofmator is right,
      regexes aren't probably the good solution.

      Especially with other simple answers available like :
      print join'/',grep{if(!$u{$_}){$u{$_}=1}}split'/',$data;

      Note to hofmator: don't hit me too hard...
      My (poor) regex was not meant to be a solution but rather to show a possible way to study
      (wasn't approach bold enough ? ;-)
      Anyway I++ your post beccause your rightly outline some of the problems that may arise with the regex approach...
      and in particular in my poor regex approach


      "Only Bad Coders Code Badly In Perl" (OBC2BIP)
      /Jeff(?:rey)?/

      Well, I have two comments: a regex is a cool approach to take, and a regex is the wrong approach to take. You said yourself you could split and use a hash. That's what I'd do. I'd use a regex if I were trying to impress someone (which I'm not). But if you are, then here's my offering:

      $cities = "Here/There/Everywhere/There/Again/Here/Too"; $cities =~ s{ ([^/]+) # non-slashes (the city): \1 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )*? # 0 or more times, non-greedily \1 # is the city here again? (?: /|$ ) # a / or end-of-string ) } {}gx;

      _____________________________________________________
      Jeff japhy Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Right argument, japhy!! But it's an interesting problem nevertheless. Your regex breaks on: New York/York/Boston => New York/Boston My fix makes it sadly less elegant:

        1 while s{ (/|^) # add: capture slash or start of string: $1 ([^/]+) # non-slashes (the city): \2 / # a / (?= # look ahead for:... (?: # group (not capture) [^/]* / # city and / )*? # 0 or more times, non-greedily \2 # is the city here again? (?: /|$ ) # a / or end-of-string ) } {$1}gx; # add: replacement
        Now this is close to my approach:  1 while s#(/|^)((?>[^/]*))/(?=(?:.*?/)?\2(?:/|$))#$1#g; The advantage of both solutions is that as little as possible is replaced. The disadvantage is that a regex isn't the right way to do it ;) - for all the reasons already given in this thread.

        -- Hofmator

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://99016]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-26 08:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found