in reply to Re: Recursive Regex
in thread Parsing using Regex and Lookahead


Nice remarks! I will have to try this when I get home.


As for the stream, you are right. Also, I wouldn't say I'm a newbie and I guess novice was incorrect to say as well; if I had to call my level something, I guess I should have said a previous amateur, never reaching the expert or monkism, but a little better than a neophyte.

I've done much of this stuff before, but have since forgot. Hence, me opening the thread looking for help, but then remembering the "lookahead" was what I was after.

The issue here is that I'm going to have to manipulate this in the future. For simplicity I made the delimiters: ([]) and (\n), which may or may not have nested text. So, I used (.*) on purpose because I might want to include something in a blank div for formatting.

I know I could just write everything in HTML-like syntax, or possibly some made up pseudo-SGML, like Wiki, but that would both take the fun away and increase my typing time. -- Something just interests me about having one label.

One of my concerns is that I wanted it to be efficient. I think streaming is the best choice - no question with lengthy strings.

I don't recall what backtracking is, so I'm going to have to evaluate what moritz was talking about with: /^\[ ([^\]]+) \]/x I vaguely remember an issue like what he said with ab, but for some reason I thought the non-greedy (?) would take that away.

Replies are listed 'Best First'.
Re: Recursive Regex: Response
by deMize (Monk) on Mar 11, 2009 at 20:22 UTC
    Okay so after looking at it, it looks like: ([^\]]+) is basically saying, match anything not a close-bracket. I would use a (*) instead of a (+) for the reason of the empty div discussed above.

    I question whether (?>) might be of use


    Additionally, I don't know how I would stream the text. It's a parameter passed from a webform.
      Sorry not to reply sooner, but I've been busy.

      Backtracking is simply what the regex engine does when it can't make a match, but still has other items to consider. A simple example is if you want to match /(this|that).*(these|those)/, the engine first looks for a 't', then an 'h', etc. If it finds an 'n' after 'thi', then it backtracks to see if it can match 'that'. In this case, though it might not be nice to look at, breaking it out into four regexes (/this.*these/, /this.*those/, etc) turns out to be more efficient than the alternation version because if it fails to find 'this', for example, it simply fails without trying additional matches.

      Anyway, (?>...) is a way to cut off backtracking for hairy regexes. It can make parsing a lot faster.