dan2bit has asked for the wisdom of the Perl Monks concerning the following question:

looking for a single regexp to do
s/(\s+)/\ /g
UNLESS $1 is inside an HTML tag <(.*)$1(.*)> in other words, I want to substitute all white space outside of HTML tags in a (single line) string

so far it's making my brain hurt

Replies are listed 'Best First'.
Re: looking for a regexp
by swiftone (Curate) on Jun 08, 2000 at 19:23 UTC
    Look into HTML::Parser. HTML is very hard to find because you can have nested < >. merlyn's WebTechnique columns have many HTML::Parser examples, as did the latest issue of The Perl Journal

    That said, a basic Regexp to match simple HTML is:

    /<[^>]*>/ #matches an HTML tag #So you would want: # s/(<[^>]*>[^<\s]*)\s+/$1\&nbsp;/g #Should match a tag followed by some non-tag, non whitespace, followed + by whitespace. Untested.
    You will also have to match any whitespace before the first tag, but you can probably handle that.
      And swiftone was known to speak:
      That said, a basic Regexp to match simple HTML is:
      /<[^>]*>/ #matches an HTML tag
      Uh, no. This incorrectly stops on
      <hello there="inside > foo">
      too early. Please use HTML::Filter or one of the other HTML::Parser-derived modules.

      -- Randal L. Schwartz, Perl hacker

        Uh, no. This incorrectly stops on
        <hello there="inside > foo">

        That's not _simple_ HTML anymore. :) Packages are better (as I suggested), but sometimes you need a quick script for a simple task and you don't want to have to learn a package to do it.

Re: looking for a regexp
by cwest (Friar) on Jun 08, 2000 at 20:20 UTC
    This will do all you ask.
    s/\s([^<]*<[^>]*>|$)/&nbsp;$1/g;
    
    --
    Casey
    
      That is, unless you ask it to do-- <B>This</B> Line <I>fails</I>. This is quickly becoming a FAQ. Hopefully it'll be in the new Q&A section.

      See Substitution outside of HTML TAGS or other threads turned up by searchng for 'html tags'.

RE: looking for a regexp
by dan2bit (Acolyte) on Jun 08, 2000 at 19:23 UTC
    there should be a nbsp in the second part of the substitution (failed to read the "escape characters" help file - D'oh!) {::} =D------an (sometimes the cord just doesn't quite reach)