Remove text between two Start and End Tags (Regex)

danj35 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I need to remove text between two unique points in a string to leave just text outside of these tags. I've made a regular expression to find and replace the text with nothing, but it doesn't seem to be working... Any ideas?

String: "Furthermore , expression of <GENE> Vpu </GENE> in Jurkat T cells rendered them more susceptible to <GENE> Fas </GENE> - induced death "

$sentence =~ s/\[<GENE>\]\s*(((?!\[<GENE>\]|\[<\/GENE>\]).)+)\s*\[<\/GENE>\]//gi;

I want it to leave, "Furthermore , expression of in Jurkat T cells rendered them more susceptible to - induced death", but it doesn't!

Cheers

Comment on Remove text between two Start and End Tags (Regex) Download Code

Replies are listed 'Best First'.
Re: Remove text between two Start and End Tags (Regex) by moritz (Cardinal) on Apr 19, 2011 at 15:20 UTC
Much simpler: `$sentence =~ s{<GENE>.*?</GENE>}{}gsi;` [download] (Update: Added s modifier, suggested by wind++). Perl 6 - second systems done right	[reply] [d/l]
Re^2: Remove text between two Start and End Tags (Regex) by danj35 (Sexton) on Apr 19, 2011 at 15:25 UTC
Works great thanks. Haven't seen the use of those brackets for this type of problem before. I guesss everything in the first set is the regex to find and the second set are the text to replace it with. GI being greedy and non-case sensitive. Learnt something new there... ;) Cheers	[reply]
Re^3: Remove text between two Start and End Tags (Regex) by ikegami (Patriarch) on Apr 19, 2011 at 16:08 UTC
Yeah, you don't have to use "/". You can use practically anything. When dealing with HTML and the like, it's more convenient to use something other than "/".	[reply]
Re^3: Remove text between two Start and End Tags (Regex) by MidLifeXis (Monsignor) on Apr 19, 2011 at 15:30 UTC
Not greedy ~~(although it is)~~, global (meaning find all of them). Update: Show me to read quickly. --MidLifeXis	[reply]
Re^4: Remove text between two Start and End Tags (Regex) by danj35 (Sexton) on Apr 19, 2011 at 16:07 UTC
Re^5: Remove text between two Start and End Tags (Regex) by choroba (Cardinal) on Apr 19, 2011 at 21:26 UTC
Re: Remove text between two Start and End Tags (Regex) by toolic (Bishop) on Apr 19, 2011 at 15:20 UTC
Get rid of all the square brackets from your regex because you do not have any in your input string: `use warnings; use strict; my $sentence = "Furthermore , expression of <GENE> Vpu </GENE> in Jurk +at T cells rendered them more susceptible to <GENE> Fas </GENE> - ind +uced death "; $sentence =~ s/<GENE>\s(((?!<GENE>\|<\/GENE>).)+)\s<\/GENE>//gi; print "$sentence\n"; __END__ Furthermore , expression of in Jurkat T cells rendered them more susc +eptible to - induced death` [download] Running your regex through YAPE::Regex::Explain highlighted the square brackets	[reply] [d/l]
Re: Remove text between two Start and End Tags (Regex) by ikegami (Patriarch) on Apr 19, 2011 at 15:19 UTC
`s{ <GENE> (?: (?! </GENE> ) . )* </GENE> \s* }{}xsg;` [download]	[reply] [d/l]
Re^2: Remove text between two Start and End Tags (Regex) by danj35 (Sexton) on Apr 19, 2011 at 15:22 UTC
Thanks	[reply]
Re: Remove text between two Start and End Tags (Regex) by patcat88 (Deacon) on Apr 20, 2011 at 00:43 UTC
You dont need to use regular expressions to solve your problem. An order of magnitude less CPU intensive to use index and substr. Make sure your XML tags are ALWAYS the same before choosing to use index. '<GENE>' and '<Gene>' and '< GENE>' are totally different to index. `$string = 'Furthermore , expression of <GENE> Vpu </GENE> in Jurkat T +cells rendered them more susceptible to <GENE> Fas </GENE> - induced +death '; $start = 0; while (($beg = index($string, '<GENE>', $start)) > -1) { $end = index($string, '</GENE>', $start)+7; substr($string, $beg, ($end-$beg), ''); $start = $end; } print $string;` [download] I do notice a double space left where the GENE tag used to be though.	[reply] [d/l]