Regular Expression Elegance

monkie has asked for the wisdom of the Perl Monks concerning the following question:

Hi All ,

I'm looking for a single line Regular Expression solution to the following:

I need to removes all <br> tags between "href='" and the following "'" from a string.

The solution that I have come up with is:

#------------------------
$value = "<a href='\\192.161.254.00\share\Testing\Company ABC\Discover
+y <br>Forms\587 <br>Read and Write Gold V8.0.doc' target='_blank' >Fi
+le on <br>Cshare</a>";


while ($value =~ m/(href='.*?)<br>(.*?')/gsi)
        {
            $value =~ s/(href='.*?)<br>(.*?')/$1$2/gsi;
        }
#------------------------
[download]

This seems to be hunky-dorie but I’m sure that there a single line regular expression solution, don’t you?

Comment on Regular Expression Elegance Download Code

Replies are listed 'Best First'.
Re: Regular Expression Elegance by massa (Hermit) on Nov 11, 2008 at 13:10 UTC
`$value =~ s{(href=')([^'])(')}{$1.join('',split/<br>/,$2).$3}gsie;` [download] should work perfectly. Update*: changed regex delims. []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l]
Re^2: Regular Expression Elegance by monkie (Novice) on Nov 11, 2008 at 14:31 UTC
.. and it does work perfectly. It doesn' suffer from the problem that JavaFan pointed out in my origional either. Perfect, now that's Regular Expression Elegance!	[reply]
Re: Regular Expression Elegance by JadeNB (Chaplain) on Nov 11, 2008 at 15:32 UTC
This seems to be hunky-dorie but I’m sure that there a single line regular expression solution, don’t you? I've been here long enough to know that the proper answer to "What regex allows me to manipulate this markup language in the way I want?" is "Don't use regexes to manipulate markup; use dedicated parsers", and yet I see that far wiser monks than I have given different answers. On reflection, that's probably because your `$value` is like no HTML on earth. Where is this coming from? (It's irrelevant to the solution, but I'm curious.)	[reply] [d/l]
Re^2: Regular Expression Elegance by monkie (Novice) on Nov 11, 2008 at 16:23 UTC
you're opening a can of worms here . . . legacy code is the short answer! user input is saved from a multi-line textbox to a db. To display the text, carrige returns are handled by this: `$value =~ s/\n/<br>/g;` recently code has been added to allow hyperlinks to be inputed by users, using custom tags handled by another reg expresion. If a typed link is longer than the multi-line textbox width, it will wrap and thanks to the above reg expresion will contain a <b> which breaks the link. now you know . . .	[reply] [d/l]
Re^3: Regular Expression Elegance by JadeNB (Chaplain) on Nov 11, 2008 at 16:28 UTC
I guess that it's not possible to get at the `$value` before the `<br>`-substitution—maybe making a copy of it, so that you don't stomp on the display manipulations—and perform the substitution `$value =~ s/\n//g` on it instead?	[reply] [d/l] [select]
Re^4: Regular Expression Elegance by monkie (Novice) on Nov 12, 2008 at 11:25 UTC
Re: Regular Expression Elegance by JavaFan (Canon) on Nov 11, 2008 at 13:26 UTC
This seems to be hunky-dorie But it isn't. If you start with $value being `"<a href='foo'>bar</a>bla<br>bla<a href='bar'>foo</a>";` [download] you end with $value being `"<a href='foo'>bar</a>blabla<a href='bar'>foo</a>";` [download] You're bitten by the notion that `/PAT1'(.?)'PAT2/` will not end up with a `'` in $1. I'd go with: `while ($value =~ /href='\K([^']<br>[^']*)(?=')/p) { my ($pre, $post) = (${^PREMATCH}, ${^POSTMATCH}); my $href = $1; $href =~ s/<br>//g; $value = "$pre$href$post"; }` [download]	[reply] [d/l] [select]
Re^2: Regular Expression Elegance by monkie (Novice) on Nov 11, 2008 at 14:25 UTC
I see. Well spotted! That was some quick thinking.	[reply]
Re: Regular Expression Elegance by gone2015 (Deacon) on Nov 11, 2008 at 15:59 UTC
As others have pointed out, the key here is `[^']` to keep inside the `'....'`. This is a little shorter: `while ($value =~ s/href='[^']?\K<br>//gsi) {} ;` [download] but requires n+1 passes across `$value`, where n is the maximum number of `<br>` between `href'` and `'`. Or, following massa, above: `$value =~ s{href='[^']?\K<br>([^']*)}{local $_ = $1, s/<br>//g, $_} +gsie;` [download] which does one pass only, and only processes any `href'...'` which contain one or more `<br>`. (A `(?=')` before the first `}` will stop it processing an unterminated `href'...'`, should you care.) And that fills my quota of regex finagling for the day.	[reply] [d/l] [select]
Re^2: Regular Expression Elegance by monkie (Novice) on Nov 11, 2008 at 16:07 UTC
I love it !	[reply]