danj35 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to remove text between two unique points in a string to leave just text outside of these tags. I've made a regular expression to find and replace the text with nothing, but it doesn't seem to be working... Any ideas?

String: "Furthermore , expression of <GENE> Vpu </GENE> in Jurkat T cells rendered them more susceptible to <GENE> Fas </GENE> - induced death "

$sentence =~ s/\[<GENE>\]\s*(((?!\[<GENE>\]|\[<\/GENE>\]).)+)\s*\[<\/GENE>\]//gi;

I want it to leave, "Furthermore , expression of in Jurkat T cells rendered them more susceptible to - induced death", but it doesn't!

Cheers

Replies are listed 'Best First'.
Re: Remove text between two Start and End Tags (Regex)
by moritz (Cardinal) on Apr 19, 2011 at 15:20 UTC
      Works great thanks. Haven't seen the use of those brackets for this type of problem before. I guesss everything in the first set is the regex to find and the second set are the text to replace it with. GI being greedy and non-case sensitive. Learnt something new there... ;) Cheers
        Yeah, you don't have to use "/". You can use practically anything. When dealing with HTML and the like, it's more convenient to use something other than "/".

        Not greedy (although it is), global (meaning find all of them).

        Update: Show me to read quickly.

        --MidLifeXis

Re: Remove text between two Start and End Tags (Regex)
by toolic (Bishop) on Apr 19, 2011 at 15:20 UTC
    Get rid of all the square brackets from your regex because you do not have any in your input string:
    use warnings; use strict; my $sentence = "Furthermore , expression of <GENE> Vpu </GENE> in Jurk +at T cells rendered them more susceptible to <GENE> Fas </GENE> - ind +uced death "; $sentence =~ s/<GENE>\s*(((?!<GENE>|<\/GENE>).)+)\s*<\/GENE>//gi; print "$sentence\n"; __END__ Furthermore , expression of in Jurkat T cells rendered them more susc +eptible to - induced death

    Running your regex through YAPE::Regex::Explain highlighted the square brackets

Re: Remove text between two Start and End Tags (Regex)
by ikegami (Patriarch) on Apr 19, 2011 at 15:19 UTC
    s{ <GENE> (?: (?! </GENE> ) . )* </GENE> \s* }{}xsg;
      Thanks
Re: Remove text between two Start and End Tags (Regex)
by patcat88 (Deacon) on Apr 20, 2011 at 00:43 UTC
    You dont need to use regular expressions to solve your problem. An order of magnitude less CPU intensive to use index and substr. Make sure your XML tags are ALWAYS the same before choosing to use index. '<GENE>' and '<Gene>' and '< GENE>' are totally different to index.
    $string = 'Furthermore , expression of <GENE> Vpu </GENE> in Jurkat T +cells rendered them more susceptible to <GENE> Fas </GENE> - induced +death '; $start = 0; while (($beg = index($string, '<GENE>', $start)) > -1) { $end = index($string, '</GENE>', $start)+7; substr($string, $beg, ($end-$beg), ''); $start = $end; } print $string;
    I do notice a double space left where the GENE tag used to be though.