Parsing using Regex and Lookahead

deMize has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Recursive Regex by moritz (Cardinal) on Mar 11, 2009 at 07:52 UTC
Just a small hint: when you have something of the form `^\[ .? \]` the regex engine can backtrack over the `]` char, which might not be what you want. For example it can match the whole string of `[a][b]` if something after that part of the regex causes backtracking. Instead you can use this regex: `^\[ [^\]] \]` Which always matches an opening bracket, other characters and then the closing bracket, so in the example above it will never match more than the `[a]` (whitespaces added for clarity, you need the /x modifier if you want to keep them).	[reply] [d/l] [select]
Re: Recursive Regex by johngg (Canon) on Mar 11, 2009 at 09:46 UTC
While you are on the subject of lookaheads, this `$code =~ s/\n//g; #remove all the newlines $code .= "\n"; #add one to the end` [download] could be replaced by this `$code =~ s/\n(?=.)//g # remove all but the last newline` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Recursive Regex by JavaFan (Canon) on Mar 11, 2009 at 11:09 UTC
Your replacement isn't equivalent to the code you claim it can replace. First of all, your code will not remove a newline followed by a newline - you'd need `(?=(?s:.))` for that, or the `/s` modifier. Second, the original code will always let `$code` end with a newline; regardless whether it ended with a newline. In your replacement, there will only be a trailing newline in `$code` if there was one originally.	[reply] [d/l] [select]
Re^3: Recursive Regex by johngg (Canon) on Mar 11, 2009 at 11:44 UTC
Good catch re. the `s` modifier, I missed that. Thanks for the correction. With regard to your second point, from the way the OP initialised `$code` I don't think always ending with a newline was the requirement s/he was addressing. For the more general case you are correct. Cheers, JohnGG	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: Recursive Regex by deMize (Monk) on Mar 11, 2009 at 17:21 UTC
So I'm thankful for the help! First off: deMize Ť he ť Second, I'm not sure what I did was the best way of going about things. What I'm doing is just building a CMS insertion page with as little markup as possible, where each section is contained in it's own div. Response: The first problem was getting each into their own div, which the lookahead helped. To accomplish this, it just ends the current div when it hits a new section (may be a subsection in the future). The reason for removing the line breaks was because I was using that as a delimiter for the last section --- there is probably a better way of doing that with determining the end of the string in the RegEx (maybe $), but I need to replace those line breaks with HTML breaks anyhow. The problem: This means that it is going to be linear with no sub-divs. I might have to rethink that for later, because I might want to have something like this later: `[section] [top]Top Data [middle]Mid Data [bottom]Bottom Data [section] [top]Top Data [bottom]Bottom Data` [download] Should result to: `<div class="section"> <div class="top">Top Data</div> <div class="middle">Middle Data</div> <div class="bottom">Bottom Data</div> </div> <div class="section"> <div class="top">Top Data</div> <div class="bottom">Bottom Data</div> </div>` [download] What I plan to do is store either the sub sections in an array or the sections in an array. I could probably use help with a better algorithm. The whole purpose of this was so that I could quickly type the data into one input box, without building a whole intricate interface (that can come later).	[reply] [d/l] [select]
Re^4: Recursive Regex by furry_marmot (Pilgrim) on Mar 19, 2009 at 20:36 UTC
Re: Recursive Regex by furry_marmot (Pilgrim) on Mar 11, 2009 at 18:12 UTC
deMize, Try this code: `my $code = <<EOT; [head] Head text... [body] Body text... [something else] more text.. EOT $code =~ s/[\n ]+?(?=\S)//sg; #remove all newlines but last formatCode($code); sub formatCode { my $str = shift; $str =~ s{ \[ ([^\]]+) \] ([^\]]) (?=\[\|\n)} {<div class="$1">$2</div>}igx; print $str; }` [download] I took your code, mixed in some of the suggestions of others, and made some changes I hope you find useful (since you asked). First, I changed the quoting to a "here doc", which is more like the file you'll probably be reading from in most situations. Then I made a slight change to the way suggested below of removing the newlines, using a zero-width positive lookahead assertion (look up "?=" in perldoc perlre). The changes I made also accommodate blank lines in the source, which will help with readability. Finally, in the regex, I used the /x modifier and curly braces to aid in the readability, and used a class, so you're now looking for "bracket, (not a bracket)+, bracket, (not a bracket), stop before a bracket". Using "+" in the first capture and "" in the second capture enforces an assumption (that you should think about and modify to suit) that brackets ALWAYS have a div name in them, but there might be no text after it. If there must always be text after a div name, use the "+" in both cases. Since you are a professed newbie, I'll tell you that this is a common technique when you're learning. Later, you'll find it more practical (especially with large files) to process a stream, rather than manipulate a huge string. So, I'll leave you with this: my $code = <<EOT; [head] Head text... [body] Body text... [something else] more text.. EOT $prev_header = 0; # takes care of closing divs for (split /\n/, $code) { # No need to remove newlines if (/^\s$/) { next } # Skip blank lines elsif ( /^\[ ([^\]]+) \]/x ) { # If line is a header... print "</div>" if $prev_header; # close prev div, if there w +as one... print qq(<div class="$1">); # print current div... $prev_header = 1; # and set current div flag. } else { print "$1 " } # Simply print non-header line +s } print "</div>"; # You always have to close last div, so just do it her +e [download] To read from a file: `open, $FH, "<whatever.txt"; for (<$FH>) { chomp; ... }` [download]	[reply] [d/l] [select]
Recursive Regex: Response by deMize (Monk) on Mar 11, 2009 at 19:17 UTC
Nice remarks! I will have to try this when I get home. As for the stream, you are right. Also, I wouldn't say I'm a newbie and I guess novice was incorrect to say as well; if I had to call my level something, I guess I should have said a previous amateur, never reaching the expert or monkism, but a little better than a neophyte. I've done much of this stuff before, but have since forgot. Hence, me opening the thread looking for help, but then remembering the "lookahead" was what I was after. The issue here is that I'm going to have to manipulate this in the future. For simplicity I made the delimiters: ([]) and (\n), which may or may not have nested text. So, I used (.*) on purpose because I might want to include something in a blank div for formatting. I know I could just write everything in HTML-like syntax, or possibly some made up pseudo-SGML, like Wiki, but that would both take the fun away and increase my typing time. -- Something just interests me about having one label. One of my concerns is that I wanted it to be efficient. I think streaming is the best choice - no question with lengthy strings. I don't recall what backtracking is, so I'm going to have to evaluate what moritz was talking about with: `/^\[ ([^\]]+) \]/x` I vaguely remember an issue like what he said with a b, but for some reason I thought the non-greedy (?) would take that away.	[reply] [d/l]
Re: Recursive Regex: Response by deMize (Monk) on Mar 11, 2009 at 20:22 UTC
Okay so after looking at it, it looks like: (`[^\]]+`) is basically saying, match anything not a close-bracket. I would use a (*) instead of a (+) for the reason of the empty div discussed above. I question whether (?>) might be of use Additionally, I don't know how I would stream the text. It's a parameter passed from a webform.	[reply] [d/l]
Re^2: Recursive Regex: Response by furry_marmot (Pilgrim) on Mar 19, 2009 at 20:26 UTC
Re^2: Recursive Regex by deMize (Monk) on Mar 12, 2009 at 03:15 UTC
Just so everyone knows, some changes needed to be made: The updated routine: sub formatCode{ my $code = shift; my $prev_header = 0; # takes care of closing + divs for (split /\n/, $code) { # No need to remove new +lines if (/^\s$/) { next } # Skip blank lines elsif ( /^(.?)\[ ([^\]]) \](.)/x ) { # If line is a h +eader... print "$1</div>" if $prev_header; # close prev div, if +there was one... print qq(<div class="$2">$3); # print current div.. +. $prev_header = 1; # and set current div + flag. } else { print "$_<br />" } # Simply print non-head +er lines } print "</div>" if $prev_header; # You always have to cl +ose last div, so just do it here } [download] Basically $1 needed to be changed to $_. Also, I added matches for before and after the tag, in case there was inline text. Finally, the line break was added in replace of a newline character. I'm still curious if $prev_header should be reset to 0 after the close div has been called. I guess it's not necessary.	[reply] [d/l]
Re^3: Recursive Regex by deMize (Monk) on Mar 12, 2009 at 14:16 UTC
Update: Looking at it now, this is not going to work. If I wish to have all the input on one line, I'd still need the look ahead. For example, the above solution will not fix multiple inline statements `[head]Title Text[body]Blah Blah Blah` In replace of the lookahead, I can think of two simple solutions using split: I'd either need to first loop through the string and place an inline character before each `[\w]` pattern, or I can delimit on the pattern itself. So this is what I came up with: `sub formatCode2{ my $code = shift; my @arrCode = split (/\[([^\]])\]/, $code); my $size = @arrCode; # print whatever b4 1st delimiter if (@arrCode >= 1) { $_ = $arrCode[0]; s/\n/<br \/>/ig; print qq(<div class="">$_</div>); } # print sections for (my $cnt = 1; $cnt < @arrCode; $cnt+=2){ $_ = $arrCode[$cnt+1]; s/\n/<br \/>/ig; print qq(<div class="$arrCode[$cnt]">$_</div>); } }` [download]	[reply] [d/l] [select]
Re^4: Recursive Regex by furry_marmot (Pilgrim) on Mar 19, 2009 at 21:02 UTC