Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl while(<DATA>){ s/\s(?=(<.+>|<.+\/>|<\/.+>))//g; print $_; } __DATA__ '' <p> Though Memphis <span> </span> fans may be Rodgers problem +s. </p><br/> Many ....
How to remove the space between and after tags. Where the output is
''<p>Though Memphis <span></span>fans may be Rodgers problems.</p><br/ +>Many ....

Replies are listed 'Best First'.
Re: space before and after
by graff (Chancellor) on Sep 03, 2009 at 08:13 UTC
    First you said "... remove the space between and after the tags." Then later on you said "Remove the spaces in front of tags and after the tags." Those are two different things, but neither of them is such a good idea, frankly.

    As ikegami said in his first reply, browsers always collapse consecutive white-space characters in html when rendering the text, so mucking with space characters in an html file is really unnecessary (from the point of view of someone reading the text in a browser).

    If you think about html tags for a little bit, you'll notice that some of them (like <p> <table> <blockquote> <br/> and so on) are designed to control how browsers apply white-space when rendering html text (i.e. how they add spacing to enforce things like word separation, line breaks and indenting), while others (like <div> <span> <form> <input> and so on) have no impact on (do not add or control) spacing at all.

    So, a process that blindly removes space characters that are adjacent to all tags is very likely to cause some damage to the text (from the point of view of someone trying to read it in a browser), because for some of those tags (div, span, form, etc), the space(s) next to the tag might be the only basis for separating two words that surround it.

    If you think you have some other important reason for doing this (unrelated to what browsers normally do), it would help if you explain that. Depending on why you really want to do this, it's likely that you'll need to use one of the HTML parsing modules (e.g. HTML::Parser), and you'll need to be fairly careful about deciding which spaces to remove and which to keep.

    (updated to add a couple words that were missing)

Re: space before and after
by ikegami (Patriarch) on Sep 03, 2009 at 04:53 UTC

    Did you mean to keep the space before the <span> tag?

    HTML collapses multiple spaces into one, so what's the point? It'll only serve to introduce errors.

      Remove the spaces in front of tags and after the tags.

        So the output is wrong, then.

        s/\s+(?=<)//g; s/(?<=>)\s+//g;
        • Assumes no unescaped "<" and ">" in flow or other content.
        • Assumes no "<" and ">" in attribute values, comments or in CDATA sections.
        • Assumes no NET tags (a poorly supported SGML construct).
        • Disregards the fact that this changes the HTML to something that's not equivalent.