in reply to Regex within html

While this might be achievable with regexes, I don't recommend it. Parsing HTML with regex is a fundamentally bad idea, because regexes aren't good for matching nested data structures.

What I'd recommend instead is to tokenize your text, that is split it up into chunks that are 1) either normal text or 2) your special comments or 3) opening or closing <a> tags.

Then iterate over all these chunks, and count the difference in the number of opening and closing anchor tags. While iterating over these tokens you construct an output string, and in that string it shouldn't be too hard to get the nesting of <a> tags correctly.

Replies are listed 'Best First'.
Re^2: Regex within html
by ropey (Hermit) on Sep 08, 2008 at 12:38 UTC

    Thanks for assist Moritz

    Im not sure how you would 'tokenize' it in the first place ? would that not have a regex as well ?

    Its also worth commenting this isnt about a templating system, the raw html is generated by another host (which I have no control over) and just have that to work with ?

      Im not sure how you would 'tokenize' it in the first place ? would that not have a regex as well ?

      It sure would, but the point is that it would need one regexp per possible token type, not one huge regex that solves the whole problem.

      Usually I use the tokenizer from Math::Expression::Evaluator::Lexer (don't let the name fool you; it's good for more than mathematical expressions), from which you could draw inspiration.

      And don't use .* in your regexes, that's almost always an error. See Death to Dot Star!.