Use of one of the fine CPAN XML parsing modules is almost certainly the best course. Perhaps some monk better versed than I in XML parsing can suggest appropriate choices. Novice monks often protest that these XML modules represent "too much code for my application" and want "just a simple" solution. This desire is usually a snare and a delusion: XML is complicated, and "simple" solutions are fragile and scale poorly.

However, if you are set on a simple solution, here are a couple of regex-based ones. Both can operate on strings containing embedded double-quotes and other stuff. Again, both are inherently fragile. The second approach is both more specific as to the tags to be deleted and more tolerant of tag casing and whitespace.

Update: Changed following code example to be more Windose double-quote-friendly.

>perl -wMstrict -le "my $s = '<a>foo</a><bc>bar</bc> <def>baz</def> \"x\" <ghij>%&*</ghij>'; print qq{'$s'}; ;; $s =~ s{ < ([^>]+) > ((?: (?! </ \1) .)*) </ \1 > }{$2}xmsg; print qq{'$s'}; ;; ;; $s = '<B>foo</ b > <efg>bar</efg> \"stuff\" <cD >*&!</ Cd>'; print qq{'$s'}; ;; my @tags = qw(b cd); my $tag = join '|', @tags; $tag = qr{ (?i) $tag }xms; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } {$2}xmsg; print qq{'$s'}; " '<a>foo</a><bc>bar</bc> <def>baz</def> "x" <ghij>%&*</ghij>' 'foobar baz "x" %&*' '<B>foo</ b > <efg>bar</efg> "stuff" <cD >*&!</ Cd>' 'foo <efg>bar</efg> "stuff" *&!'

Update: I just noticed the "and their content" requirement in the OPed title and output examples. Here's a two-pass regex solution (Update: Changed to make more modular, self-documenting):

>perl -wMstrict -le "my $s = '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>'; print qq{'$s'}; ;; my $ar_tag_delete_content = [ 1, tag_group_regex(qw(efg) ) ]; my $ar_tag_leave_content = [ 0, tag_group_regex(qw(b cd)) ]; ;; for my $pass ($ar_tag_leave_content, $ar_tag_delete_content) { my ($delete_content, $tag) = @$pass; use re 'eval'; $s =~ s{ < \s* ($tag) \s* > ((?: (?! </ \s* \1) .)*) </ \s* ([^>]*) (?(?{ lc($1) ne lc($^N) }) (*F)) \s* > } { $delete_content ? '' : $2 }xmsge; print qq{'$s'}; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>'

Further Update:
Hey, wait a minute...
Does the foregoing even work?
Answer: No. Try it with the string  '<b>foo</B> bar <b>baz</B>' and it falls over.

The following works better, is simpler, and also gets rid of the quite unnecessary  (?(?{ lc($1) ne lc($^N) }) (*F)) business. (But this is still quite naive and fragile code for processing XML!)

>perl -wMstrict -le "my @strings = ( '<B>foo</ b > <EfG>bar</eFg> \"stuff\" <cD >*&!</ Cd> <x>baz</x>', '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S', '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z', ); ;; my $ar_keep_tag_content = [ 1, tag_group_regex(qw(b cd)) ]; my $ar_kill_tag_content = [ 0, tag_group_regex(qw(efg) ) ]; ;; for my $s (@strings) { print qq{'$s'}; for my $pass ($ar_keep_tag_content, $ar_kill_tag_content) { my ($keep_content, $tag) = @$pass; $s =~ s{ < \s* ($tag) \s* > (.*?) </ \s* (?i) \1 \s* > } { $keep_content ? $2 : '' }xmsge; print qq{'$s'}; } print ''; } ;; sub tag_group_regex { my $alternation = join '|', @_; return qr{ (?i) $alternation }xms; } " '<B>foo</ b > <EfG>bar</eFg> "stuff" <cD >*&!</ Cd> <x>baz</x>' 'foo <EfG>bar</eFg> "stuff" *&! <x>baz</x>' 'foo "stuff" *&! <x>baz</x>' '<b>fee</B> P <b>fie</B> Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q <efg>foe</EFG> R <efg>fum</EFG> S' 'fee P fie Q R S' '<b>hee</b> W <b>hie</b> X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X <efg>hoe</efg> Y <efg>hum</efg> Z' 'hee W hie X Y Z'

In reply to Re: remove xml tag and their content by AnomalousMonk
in thread remove xml tag and their content by zac_carl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.