in reply to A regex that does this, but not that?

It's not clear what you want. Do you want to remove all words that aren't "test" or numbers? Do you want to remove the words "thought", "tot" and "tesset"? Do you want to remove all words, except the 2nd, 4th, 5th and 6th? Do you want to remove all words that start and end with a "t", but don't have "es" (and nothing else) between them?

Being able to properly formulate what you want a regex to do solves the problem for 90%. Stating your problem by simple example just leaves people guessing.

Abigail

  • Comment on Re: A regex that does this, but not that?

Replies are listed 'Best First'.
Re: Re: A regex that does this, but not that?
by bradcathey (Prior) on Nov 15, 2003 at 01:36 UTC
    Abigail-II, I spent quite a bit of time trying to craft my example carefully, so that if there was a regex solution to return the result I specified, I'd have my answer. pg got it perfectly. But just in case you're still interested—I know you're one of the regex gurus around the monastery, and I have always appreciated your thoroughness:

    1. I want to delete any words that start with "t", end with "t", but do not contain any other "t"s within, except for the word "test".
    2. The result should only be the words: "test" and any other non t*t words. "1 2 3" was just an example.
    3. The order of words, the number of words, or the content of any other words not "t\w+t", should not be a factor.

    I'd still love to hear your thoughts as I am trying to really ramp up my coding skills. Thanks.

    —Brad
    "A little yeast leavens the whole dough."

      Well, pg's solution works for the limited input provided and you haven't given any further particulars regarding input. That solution breaks just changing the first word from "thought" to "though" :

      my $var = "though test tot 1 2 3 tesset"; $var =~ s/(t.*?t)/($1 ne "test") ? "" : $1/ge; print $var; # prints: esoesset

      But, now you mention a further constraint that the words to be deleted may not contain any 't's inside, which is not inferrable from your earlier posts at all. Providing a good specification is much more than providing a sample case (but providing test cases *is* important).

      Anyway, here's a go at your new specs:

      my $var = <<TT; target blah foo test thought 123 though tempest testament though tightest treatment thermostat tantamount taboo TT $var =~ s/(?!\btest\b)(\bt[^t\W]*t\b)//g; print $var; __END__ ## Result: blah foo test 123 though testament though tightest treatment thermostat tantamount taboo

      So, all the 't.*t' words on the second line remain because they contain a 't' character within. All the 't.*t' words on the first line get deleted except for 'test'.

      any words that start with "t", end with "t", but do not contain any other "t"s within

      OK so that's \bt[^t]+t\b -- word-boundary, then a t, then one or more other characters not a t, then a t, then a word boundary.

      Apart from the abbreviation "tt" this should be fine.

      So "tent", "tesseract", "tot", "tort" and "test" itself will match this pattern.

      However, "testament" will fail it because of the "t" in the middle.

      Then you need a special case for "test" itself, which you can do with the /e modifier and the ternary operator, as in pg's example above.

      So something like this:

      #!/usr/bin/perl -w use strict; my $words='test Buffy testament Anya tot Willow tesseract Faith tent'; $words =~ s/\b(t[^t]+t)\b/$1 eq "test" ? $1 : ''/ge; print $words; # prints 'test Buffy testament Anya Willow Faith';

      Where the regex means "Find words matching t, something-not-t, then t at the end. Replace them with nothing, unless they're the word test, in which case, replace them with themselves".

      You could replace the ternary thing with this more longwinded version if you liked:

      $words =~ s/\b(t[^t]+t)\b/ my $temp = $1; if($temp eq 'test'){ $temp }else{ '' }/xge;


      ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print

        Your character class of [^t] can itself cross word boundaries so that strings like: "this will be a problem right?" will be a problem, right?

      I spent quite a bit of time trying to craft my example carefully, so that if there was a regex solution to return the result I specified, I'd have my answer.

      But the problem is that you left it at the example. I could have given you a couple of regexes that solved your example, but would probably have failed to do what you wanted on the second example you tried.

      pg got it perfectly.
      Then you and he got lucky. If he came up with a different regexp that solved your one example, but that would do something else on other sentences, he would have wasted time formulating a useless answer. However, is it really true that pg's answer got it right? Your requirements say:
      I want to delete any words that start with "t", end with "t", but do not contain any other "t"s within, except for the word "test".
      and pg's regex is:
      s/(t.*?t)/($1 ne "test") ? "" : $1/ge;
      Now, to me that regex just deletes strings starting with a t, and ending with the next t, with the exception of the word "test". So, let's try it on another example:
      $_ = "this is the wristwatch"; s/(t.*?t)/($1 ne "test") ? "" : $1/ge; print; __END__ he wrisch
      Now, that might be exactly what you had in mind, but it doesn't suit the requirements.

      Abigail

      s/\s*\bt(?!est)[^t\W]*t\b//g;

      Makeshifts last the longest.