Re: Regex within html

While this might be achievable with regexes, I don't recommend it. Parsing HTML with regex is a fundamentally bad idea, because regexes aren't good for matching nested data structures.

What I'd recommend instead is to tokenize your text, that is split it up into chunks that are 1) either normal text or 2) your special comments or 3) opening or closing <a> tags.

Then iterate over all these chunks, and count the difference in the number of opening and closing anchor tags. While iterating over these tokens you construct an output string, and in that string it shouldn't be too hard to get the nesting of <a> tags correctly.

Comment on Re: Regex within html Select or Download Code

Replies are listed 'Best First'.
Re^2: Regex within html by ropey (Hermit) on Sep 08, 2008 at 12:38 UTC
Thanks for assist Moritz Im not sure how you would 'tokenize' it in the first place ? would that not have a regex as well ? Its also worth commenting this isnt about a templating system, the raw html is generated by another host (which I have no control over) and just have that to work with ?	[reply]
Re^3: Regex within html by moritz (Cardinal) on Sep 08, 2008 at 13:08 UTC
Im not sure how you would 'tokenize' it in the first place ? would that not have a regex as well ? It sure would, but the point is that it would need one regexp per possible token type, not one huge regex that solves the whole problem. Usually I use the tokenizer from Math::Expression::Evaluator::Lexer (don't let the name fool you; it's good for more than mathematical expressions), from which you could draw inspiration. And don't use `.*` in your regexes, that's almost always an error. See Death to Dot Star!.	[reply] [d/l]
Re^3: Regex within html by Anonymous Monk on Sep 08, 2008 at 12:43 UTC
get yourself a html parser, and use it :)	[reply]