in reply to spaces removed in backreference

((?:<[^>]*>)*) can match nothing (because of the * quantifier), so your original expression matches every sting of one or more space characters.

When you add the . between the (?:<[^>]*>) and the *, then there must be a tag, which can be followed by anything.

A better solution might be ((?:<[^>]*>)+)

Replies are listed 'Best First'.
Re^2: spaces removed in backreference
by AnomalousMonk (Archbishop) on Apr 25, 2010 at 04:43 UTC
    A better solution might be  ((?:<[^>]*>)+)

    (?:<[^>]*>)+ or even just plain  (?:<[^>]*>) allows a pair of tags with no intervening blanks to match or to be excluded from matching, respectively, thereby potentially destroying open-tag/close-tag synchronization: in either case, the "pair" of tags between which blanks are eliminated is incorrect. This can be remedied by changing the quantifier on  $blank, but, as another reply has suggested, a proper HTML parser is really the best approach.

    >perl -wMstrict -le "my $s = '<foo a=b></foo> <bar c=d> </bar> '; my $t; ($t = $s) =~ s@((?:<[^>]*>)+) +((?:<[^>]*>)+)@$1$2@g; print qq{'$t'}; my $tag = qr{ < [^>]* > }xms; my $blank = qr{ [ ] }xms; ($t = $s) =~ s{ ($tag) $blank+ ($tag) }{$1$2}xmsg; print qq{'$t'}; ($t = $s) =~ s{ ($tag) $blank* ($tag) }{$1$2}xmsg; print qq{'$t'}; print qq{'$s'}; " '<foo a=b></foo><bar c=d> </bar> ' '<foo a=b></foo><bar c=d> </bar> ' '<foo a=b></foo> <bar c=d></bar> ' '<foo a=b></foo> <bar c=d> </bar> '

    Update: Added another example using  * quantifier.