zulqernain has asked for the wisdom of the Perl Monks concerning the following question:

i am using
s!<[/?|\\?]*?[a-z][a-z0-9]*[^<>]*>! !g;
to replace the xml tags by space. but its not working what may be the problem?

Replies are listed 'Best First'.
Re: removing tags
by mirod (Canon) on May 27, 2005 at 11:54 UTC
    perl -MXML::Parser -e'XML::Parser->new( Handlers => { Char => sub { print $_[1]; }})->parse(\*STDIN)'

    or, of course

    perl -MXML::Twig -e'print XML::Twig->new->parse(\*STDIN)->root->text'
Re: removing tags
by castaway (Parson) on May 27, 2005 at 11:34 UTC
    Your regular expression is being too greedy. The ".*" bit will get everything up until the *last* ">" in the line, which will probably eliminate the content as well. Try using:
    s/<.*?>|<\/.*?>/ /g;
    instead, which makes it stop at the first ">" character found.

    C.

Re: removing tags
by muntfish (Chaplain) on May 27, 2005 at 11:38 UTC

    Try a Super Search for "remove XML tags" or similar.

    This node may be useful. In general, consider using a module such as XML::Simple rather than a regex for parsing XML (or HTML for that matter). It'll be a lot more reliable and therefore less painful...


    s^^unp(;75N=&9I<V@`ack(u,^;s|\(.+\`|"$`$'\"$&\"\)"|ee;/m.+h/&&print$&
Re: removing tags
by bart (Canon) on May 27, 2005 at 11:41 UTC
    You search for a backslash in the second alternative, that should be a forward slash. Anyway, that latter alternative is useless, as the former one will match it too. And like castaway said, you're suffering from regexp greediness.

    Your remaining problems are:

    1. A tag can contain "<" or ">" characters in quoted attributes
    2. a tag can contain a newline. Why read line by line anyway?

    To solve the first problem, you could try the next:

    s/<(?>[^>'"]+|"[^"]*"|'[^']*')*>//g;
    Try it with
    $_ = '<foo attr="heh>hoh">bar</foo>';

    As for the second problem, processing the whole file in one go while adding the /s modifier, would probably help.

Re: removing tags
by artist (Parson) on May 27, 2005 at 12:33 UTC
    Use XML modules rather than regular expressions, for xml files. They are lot better and knows something in depth about XML formats. There are many modules to begin with, so, use the advice mentioned in other nodes.
    --Artist
Re: removing tags
by gopalr (Priest) on May 27, 2005 at 15:31 UTC
Re: removing tags
by graff (Chancellor) on May 29, 2005 at 02:21 UTC
    I like mirod's solution best. Translating that into something that looks more like the way you would normally create a perl script:
    #!/usr/bin/perl use strict; use warnings; use XML::Parser; my %handlers = ( Char => \&print_text ); my $parser = new XML::Parser( Handlers => \%handlers ); # if a file name was given on the command line # open it as STDIN: if ( @ARGV == 1) { open( STDIN, $ARGV[0] ) or die "$ARGV[0]: $!"; } $parser->parse( \*STDIN ); sub print_text { # this is a callback invoked by XML::Parser with 2 params: # 1st is the parser object, 2nd is text print $_[1]; }