Allasso has asked for the wisdom of the Perl Monks concerning the following question:

I have this routine that takes all the spaces in between a string of adjacent html tags. Only problem is it also takes the spaces out of the tags themselves. I cannot understand why this is happening.

eg: <span class="c"> becomes <spanclass="c">

$newtoken = ""; while ("$token" ne "$newtoken") { $newtoken = "$token"; $token =~ s@((?:<[^>]*>)*) +((?:<[^>]*>)*)@\1\2@g; }

If I put a "dot" metacharacter as below, it fixes it, but I don't understand why. If I am back-referencing the whole tag, why would anything inside the back-reference be affected?

$newtoken = ""; while ("$token" ne "$newtoken") { $newtoken = "$token"; $token =~ s@((?:<[^>]*>).*) +((?:<[^>]*>).*)@\1\2@g; }

Replies are listed 'Best First'.
Re: spaces removed in backreference
by graff (Chancellor) on Apr 25, 2010 at 05:10 UTC
    There is probably an easy way to do this with a real parser (and it's not easy to imagine a situation where it makes sense to worry about spaces between html tags). But putting all that aside, I wonder why you wouldn't just do something like this, using look-behind and look-ahead assertions:
    s/(?<=>) +(?=<)//g;
    Or like this, using captures (possibly a little bit "less efficient", but not enough to worry about it):
    s/(>) +(<)/$1$2/g;
    (No need to "loop until done" -- the "g" modifier takes care of the whole string.)

    I also wonder why you use backslashes instead of dollar signs for your captures in the replacement part. If you had use warnings; in your code, you would have been told: "\1 better written as $1..." (and I suspect there's a good reason why, but it escapes me at the moment).

      s/(?<=>) +(?=<)//g; s/(>) +(<)/$1$2/g;

      Don't know why I never thought of these...

      Thanks for the tip on use warnings

Re: spaces removed in backreference
by ig (Vicar) on Apr 25, 2010 at 03:10 UTC

    ((?:<[^>]*>)*) can match nothing (because of the * quantifier), so your original expression matches every sting of one or more space characters.

    When you add the . between the (?:<[^>]*>) and the *, then there must be a tag, which can be followed by anything.

    A better solution might be ((?:<[^>]*>)+)

      A better solution might be  ((?:<[^>]*>)+)

      (?:<[^>]*>)+ or even just plain  (?:<[^>]*>) allows a pair of tags with no intervening blanks to match or to be excluded from matching, respectively, thereby potentially destroying open-tag/close-tag synchronization: in either case, the "pair" of tags between which blanks are eliminated is incorrect. This can be remedied by changing the quantifier on  $blank, but, as another reply has suggested, a proper HTML parser is really the best approach.

      >perl -wMstrict -le "my $s = '<foo a=b></foo> <bar c=d> </bar> '; my $t; ($t = $s) =~ s@((?:<[^>]*>)+) +((?:<[^>]*>)+)@$1$2@g; print qq{'$t'}; my $tag = qr{ < [^>]* > }xms; my $blank = qr{ [ ] }xms; ($t = $s) =~ s{ ($tag) $blank+ ($tag) }{$1$2}xmsg; print qq{'$t'}; ($t = $s) =~ s{ ($tag) $blank* ($tag) }{$1$2}xmsg; print qq{'$t'}; print qq{'$s'}; " '<foo a=b></foo><bar c=d> </bar> ' '<foo a=b></foo><bar c=d> </bar> ' '<foo a=b></foo> <bar c=d></bar> ' '<foo a=b></foo> <bar c=d> </bar> '

      Update: Added another example using  * quantifier.

Re: spaces removed in backreference
by Anonymous Monk on Apr 25, 2010 at 03:16 UTC
    • You have a zero-length match (in $1), followed by space, followed byzero-length match ($2)
    • use re 'debug'; to see exactly what your regex matches
    • use a real HTML parser, like HTML::TreeBuilder or YAPE::HTML
Re: spaces removed in backreference
by Allasso (Monk) on Apr 25, 2010 at 14:47 UTC

    FWIW, I am actually not parsing HTML, I am generating it from RTF docs. I wanted a simple rtf2html converter that didn't mark up the page with every possible style anyone can think of (which every converter I have tried does), and just give the basic p, blockquote, i, b, u, and colored text spans. It does 90% of the tedious work, and is much easier than trying to fix the scary html that the other converters spit out.

    (ps, I like my html pages to look good and readable, and I don't like spaces where they aren't needed.)

Re: spaces removed in backreference
by Allasso (Monk) on Apr 25, 2010 at 09:46 UTC

    thanks for all the input, really helpful information. Very much appreciated. This mainly a learning experience for me, so is why I haven't thought about a "real" parser. Good suggestion though, it would probably be helpful.