monkie has asked for the wisdom of the Perl Monks concerning the following question:

Hi All ,

I'm looking for a single line Regular Expression solution to the following:

I need to removes all <br> tags between "href='" and the following "'" from a string.

The solution that I have come up with is:

#------------------------ $value = "<a href='\\192.161.254.00\share\Testing\Company ABC\Discover +y <br>Forms\587 <br>Read and Write Gold V8.0.doc' target='_blank' >Fi +le on <br>Cshare</a>"; while ($value =~ m/(href='.*?)<br>(.*?')/gsi) { $value =~ s/(href='.*?)<br>(.*?')/$1$2/gsi; } #------------------------

This seems to be hunky-dorie but I’m sure that there a single line regular expression solution, don’t you?

Replies are listed 'Best First'.
Re: Regular Expression Elegance
by massa (Hermit) on Nov 11, 2008 at 13:10 UTC
    $value =~ s{(href=')([^']*)(')}{$1.join('',split/<br>/,$2).$3}gsie;
    should work perfectly.

    Update: changed regex delims.

    []s, HTH, Massa (κς,πμ,πλ)
      .. and it does work perfectly.

      It doesn' suffer from the problem that JavaFan pointed out in my origional either.

      Perfect, now that's Regular Expression Elegance!

Re: Regular Expression Elegance
by JadeNB (Chaplain) on Nov 11, 2008 at 15:32 UTC
    This seems to be hunky-dorie but I’m sure that there a single line regular expression solution, don’t you?
    I've been here long enough to know that the proper answer to "What regex allows me to manipulate this markup language in the way I want?" is "Don't use regexes to manipulate markup; use dedicated parsers", and yet I see that far wiser monks than I have given different answers. On reflection, that's probably because your $value is like no HTML on earth. Where is this coming from? (It's irrelevant to the solution, but I'm curious.)
      you're opening a can of worms here . . . legacy code is the short answer!

      user input is saved from a multi-line textbox to a db. To display the text, carrige returns are handled by this:
      $value =~ s/\n/<br>/g;
      recently code has been added to allow hyperlinks to be inputed by users, using custom tags handled by another reg expresion. If a typed link is longer than the multi-line textbox width, it will wrap and thanks to the above reg expresion will contain a <b> which breaks the link.
      now you know . . .
        I guess that it's not possible to get at the $value before the <br>-substitution—maybe making a copy of it, so that you don't stomp on the display manipulations—and perform the substitution $value =~ s/\n//g on it instead?
Re: Regular Expression Elegance
by JavaFan (Canon) on Nov 11, 2008 at 13:26 UTC
    This seems to be hunky-dorie
    But it isn't. If you start with $value being
    "<a href='foo'>bar</a>bla<br>bla<a href='bar'>foo</a>";
    you end with $value being
    "<a href='foo'>bar</a>blabla<a href='bar'>foo</a>";
    You're bitten by the notion that /PAT1'(.*?)'PAT2/ will not end up with a ' in $1.

    I'd go with:

    while ($value =~ /href='\K([^']*<br>[^']*)(?=')/p) { my ($pre, $post) = (${^PREMATCH}, ${^POSTMATCH}); my $href = $1; $href =~ s/<br>//g; $value = "$pre$href$post"; }
      I see. Well spotted! That was some quick thinking.
Re: Regular Expression Elegance
by gone2015 (Deacon) on Nov 11, 2008 at 15:59 UTC

    As others have pointed out, the key here is [^'] to keep inside the '....'.

    This is a little shorter:

    while ($value =~ s/href='[^']*?\K<br>//gsi) {} ;
    but requires n+1 passes across $value, where n is the maximum number of <br> between href' and '.

    Or, following massa, above:

    $value =~ s{href='[^']*?\K<br>([^']*)}{local $_ = $1, s/<br>//g, $_} +gsie;
    which does one pass only, and only processes any href'...' which contain one or more <br>. (A (?=') before the first } will stop it processing an unterminated href'...', should you care.)

    And that fills my quota of regex finagling for the day.

      I love it !