UNO DOS, HTML

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UNO DOS by lhoward (Vicar) on Sep 13, 2000 at 04:13 UTC
You might be better off using HTML::Parser to parse out HTML tags, then apply your regular expression on a tag-by-tag basis. It will be very difficult to get your regular expression to work properly considering the variety and complexity that can occur in an HTML document.	[reply]
RE: Re: UNO DOS by runrig (Abbot) on Sep 13, 2000 at 04:23 UTC
I would agree, unless this is the ONLY thing he wants to do.	[reply]
Re: UNO DOS by jreades (Friar) on Sep 13, 2000 at 03:29 UTC
Your problem is the *.?** -- although you rightly tried to limit the number of don't care characters using '?' your match still grabs the smallest number of don't care characters between '<!--' and 'DOS'... which just happens to include another '-->' and '<!--', the pieces it isn't supposed to grab. You'll need a few baselines to come up with a workable regexp: Can xxxx ever include '<!--' or '-->'? (We'd better hope not) Can xxxx contain only word-like characters (\w)? Or can it include space characters as well (\s)? These would help you optimize your regexp... But the key point is that you need to limit your regexp to a single comment group containing 'DOS'. I'd suggest using: `$html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/<!--%([^->]+?) DOS ([^->]+)%-->/GONE/s; print $html;` [download] It's a little ugly, and notice that it assumes that your 'xxxx' can't contain '->', which may or may not be the case. YMMV	[reply] [d/l]
RE: Re: UNO DOS by Adam (Vicar) on Sep 13, 2000 at 04:26 UTC
Good call on the .* There is a node around here called Death to Dot Star! which explores this further. But your regex still needs work. The bracket elements are not a group, they are individual. meaning that it would catch any html item, not just -- >, because it matches the > alone. Ok? How about: `$html =~ s/<!--%(?!%-->)DOS(?!%-->)%-->/GONE/s;` [download] I'm not sure about that regex, I've never used a zero-width negative look-ahead assertion, but I think that's the right direction.	[reply] [d/l]
RE: RE: Re: UNO DOS by jreades (Friar) on Sep 13, 2000 at 17:45 UTC
I've been sweating buckets about this one ever since I left the office... <visions of minus 30XP dancing in my head > which is, of course, just the time to realize that you screwed up the regexp. :^P	[reply]
Re: UNO DOS by Anonymous Monk on Sep 13, 2000 at 03:41 UTC
`Yeah that's part of the problem.. the xxxx can contain HTML code, which may contain regular comments, <!-- ---> however, xxxx will never contain <!--% and %-->, those are only used as braces.. and to mirod, another thing is that there could be any number of these tags before/after the tag we're intending to grab..` [download]	[reply] [d/l]
RE: UNO DOS, HTML by runrig (Abbot) on Sep 13, 2000 at 04:02 UTC
My attempt: `$html =~ s/(<!--%(.*?)%-->)/($2=~m\|DOS\|)? 'GONE' : $1/esg;` [download]	[reply] [d/l]
RE: UNO DOS, HTML by mirod (Canon) on Sep 13, 2000 at 03:14 UTC
Just grab the first comment; `$html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/(<!--.?-->\s)<!--%(.?) DOS (.?)%-->/$1GONE/s; print $html;` [download] Or make sure there is a comment beforehand: `$html = '<!--% xxxx UNO xxxx %--> <!--% xxxx DOS xxxx %-->'; $html =~ s/-->\s<!--%(.?) DOS (.*?)%-->/--> GONE/s; print $html;` [download] There is probably a cleaner way to do this without capturing the first comment at all, or by using a `g` modifier, skipping the first comment and replacing the second.	[reply] [d/l] [select]
Re: UNO DOS by Anonymous Monk on Sep 13, 2000 at 04:35 UTC
well, sounds like I have to parse out each tag individually.. yucky I guess I'll try to figure out some slob fix right now and rework the design of the templates..	[reply]
RE: Re: UNO DOS by lhoward (Vicar) on Sep 13, 2000 at 04:46 UTC
Instead of reworking "the design of the templates" why not use one of the many text/html templating modules already in place? HTML::Template Text::Template and many more....	[reply]
RE: Re: UNO DOS by runrig (Abbot) on Sep 13, 2000 at 04:38 UTC
My answer DOES parse each tag individually, it uses a regex inside a regex, and seems to work.	[reply]
RE: RE: Re: UNO DOS by Anonymous Monk on Sep 13, 2000 at 04:40 UTC
yeah I got it.. I was trying to avoid that, but I guess I can't.. (?) thanks tho	[reply]