Supplying the RHS to a regex as a variable

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Supplying the RHS to a regex as a variable by BUU (Prior) on Oct 18, 2003 at 05:59 UTC
`$s = ' <p>something something</p> <h2>blah blah blah</h2> <p>something something</p> '; $before = '<h2>([^<]+)</h2>'; $after = '"<h1>$1</h1>"'; $s =~ s/$before/$after/ee; print $s;` [download] `<p>something something</p> <h1>blah blah blah</h1> <p>something something</p> Tool completed successfully` [download]	[reply] [d/l] [select]
Re: Re: Supplying the RHS to a regex as a variable by Cody Pendant (Prior) on Oct 18, 2003 at 07:19 UTC
Picture me slapping my forehead. Thank you so much. I'd forgotten all about "ee" but now I'm remembering Merlyn's classic /old macdonald/eieio regex. `($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print` [download]	[reply] [d/l]
Re: Re: Supplying the RHS to a regex as a variable by pg (Canon) on Oct 18, 2003 at 06:19 UTC
I can easily break this code by feeding it a valid html string: `$s = ' <p>something something</p> <h2>blah <<< blah blah</h2> <p>something something</p> ';` [download] Update: BUU, although parsing is not the purpose, you have to realize that implicitly it is done any way. Regexp is obviously a kind of parsing, and it has been repeatedly mentioned by many monks here that, regexp is not a good way to deal with html. It is not an easy task to come up with perfect regexp for dealing with html. On the other hand, although the purpose is not parsing, parsing is still a valid tool, isn't it? In this case, actually a better tool.	[reply] [d/l]
Re: Re: Re: Supplying the RHS to a regex as a variable by BUU (Prior) on Oct 18, 2003 at 07:21 UTC
But he's not parsing html. The parent node never even mentioned parsing html. All he wants to do is match a certain substring and replace it with another one. The minor fact that this string happens to contain data that superficially resembles html has absolutely no bearing on this.	[reply]
Re: Supplying the RHS to a regex as a variable by pg (Canon) on Oct 18, 2003 at 07:03 UTC
A better way is to use existing modules to parse html for you, for example HTML::Parser: (code is only used to demo the idea. Obviously you need to add more handling, but parsing is done for you correctly.) `use HTML::Parser; use strict; my $str = '<p>something something</p> <h2>blah <<< blah blah</h2> <p>something something</p>'; my $parser = HTML::Parser->new(default_h=> [\&handler, "tagname, text" +]); $parser->parse($str); sub handler { my ($tag, $text) = @_; if ($tag) { if ($tag eq "h2") { print "<h1>"; } else { print "<$tag>"; } } else { print $text; } }` [download]	[reply] [d/l]
Re: Re: Supplying the RHS to a regex as a variable by BUU (Prior) on Oct 18, 2003 at 07:19 UTC
I can easily break this code by feeding it: `my $str = '<p>something something</p> <h2>blah </a> blah blah</h2> <p>something something</p>';` [download] Output: `<p>something something<p> <h1>blah <a> blah blah<h1> <p>something something<p>` [download] (Whered the slash before the a go? Heck if I know)	[reply] [d/l] [select]
Re: Re: Supplying the RHS to a regex as a variable by Cody Pendant (Prior) on Oct 18, 2003 at 07:43 UTC
Thanks for the code. It really was just an example though. I was thinking about the Big Picture rather than the code at hand. `($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print` [download]	[reply] [d/l]
Re: Supplying the RHS to a regex as a variable by BUU (Prior) on Oct 18, 2003 at 07:57 UTC
Speaking to your purpose and not your mechanics, after further experimentation I came up with this: `use strict; my $s = '<p>something something</p> <h2>blah </a> blah blah </h2> <p>something something</p>'; my $start = '<h2>'; my $end = '</h2>'; my $first = index($s,$start); my $last = index($s,$end,$first); substr($s, $first, $last-$first+length $end) = '<h1>'.substr($s,$first ++length $start,$last-$first-length $start).'</h1>'; print $s;` [download] Which simply finds the first occurence of $start, then finds the next occurence after that of $end, and replaces that sub string with a new sub string. It's not quite as nice looking, but it doesn't break with random chars in between the start and end nodes. As it is now, it only replaces the first occurence of the start/end tags, but if you wanted to do it generally, you would just need to keep track of where ever the last $start or $end tag (your preference) was found and use that as the starting point for the next index. At the moment the only bug I see is nesting, which would require slightly more complicated code, probably something along the lines of finding the first occurence of $start, then finding the next occurence of $end and then checking if another $start appears between them, if so, skip that $end tag and move on. That would of course break the current ability to repeat $start as many times as you want and just end with a single $end tag, and so would require 'properly nested data', which can probably be considered a feature =] This was mostly 'off the cuff' code and by no means thoroughly tested, so if anyone else sees any major flaws I would be interested in hearing them. Hrm, perhaps something like: `while(1) { if( index( substr( $s, $start_pos, $end_pos-$start_pos ), $start_t +ag ) != -1 ) { $end_pos = index( $s, $end_tag, ++$end_pos ); } else { last; # $end_pos is good } }` [download] For the nested loops, assuming the existience of certain vars to simply things. This isn't tested at all however, and should be treated mostly as pseduo code.	[reply] [d/l] [select]
Re: Supplying the RHS to a regex as a variable by delirium (Chaplain) on Oct 18, 2003 at 14:44 UTC
How about a simple: `s/<h2>/<h1>/gi; s#</h2>#</h1>#gi;` [download] You could also use a stylesheet to change the properties of H2 tags and not bother replacing any existing markup.	[reply] [d/l]