in reply to Parsing using Regex and Lookahead

deMize,

Try this code:

my $code = <<EOT; [head] Head text... [body] Body text... [something else] more text.. EOT $code =~ s/[\n ]+?(?=\S)//sg; #remove all newlines but last formatCode($code); sub formatCode { my $str = shift; $str =~ s{ \[ ([^\]]+) \] ([^\]]*) (?=\[|\n)} {<div class="$1">$2</div>}igx; print $str; }

I took your code, mixed in some of the suggestions of others, and made some changes I hope you find useful (since you asked). First, I changed the quoting to a "here doc", which is more like the file you'll probably be reading from in most situations.

Then I made a slight change to the way suggested below of removing the newlines, using a zero-width positive lookahead assertion (look up "?=" in perldoc perlre). The changes I made also accommodate blank lines in the source, which will help with readability.

Finally, in the regex, I used the /x modifier and curly braces to aid in the readability, and used a class, so you're now looking for "bracket, (not a bracket)+, bracket, (not a bracket)*, stop before a bracket".

Using "+" in the first capture and "*" in the second capture enforces an assumption (that you should think about and modify to suit) that brackets ALWAYS have a div name in them, but there might be no text after it. If there must always be text after a div name, use the "+" in both cases.

Since you are a professed newbie, I'll tell you that this is a common technique when you're learning. Later, you'll find it more practical (especially with large files) to process a stream, rather than manipulate a huge string. So, I'll leave you with this:

my $code = <<EOT; [head] Head text... [body] Body text... [something else] more text.. EOT $prev_header = 0; # takes care of closing divs for (split /\n/, $code) { # No need to remove newlines if (/^\s*$/) { next } # Skip blank lines elsif ( /^\[ ([^\]]+) \]/x ) { # If line is a header... print "</div>" if $prev_header; # close prev div, if there w +as one... print qq(<div class="$1">); # print current div... $prev_header = 1; # and set current div flag. } else { print "$1 " } # Simply print non-header line +s } print "</div>"; # You always have to close last div, so just do it her +e

To read from a file:

open, $FH, "<whatever.txt"; for (<$FH>) { chomp; ... }

Replies are listed 'Best First'.
Recursive Regex: Response
by deMize (Monk) on Mar 11, 2009 at 19:17 UTC

    Nice remarks! I will have to try this when I get home.


    As for the stream, you are right. Also, I wouldn't say I'm a newbie and I guess novice was incorrect to say as well; if I had to call my level something, I guess I should have said a previous amateur, never reaching the expert or monkism, but a little better than a neophyte.

    I've done much of this stuff before, but have since forgot. Hence, me opening the thread looking for help, but then remembering the "lookahead" was what I was after.

    The issue here is that I'm going to have to manipulate this in the future. For simplicity I made the delimiters: ([]) and (\n), which may or may not have nested text. So, I used (.*) on purpose because I might want to include something in a blank div for formatting.

    I know I could just write everything in HTML-like syntax, or possibly some made up pseudo-SGML, like Wiki, but that would both take the fun away and increase my typing time. -- Something just interests me about having one label.

    One of my concerns is that I wanted it to be efficient. I think streaming is the best choice - no question with lengthy strings.

    I don't recall what backtracking is, so I'm going to have to evaluate what moritz was talking about with: /^\[ ([^\]]+) \]/x I vaguely remember an issue like what he said with ab, but for some reason I thought the non-greedy (?) would take that away.

      Okay so after looking at it, it looks like: ([^\]]+) is basically saying, match anything not a close-bracket. I would use a (*) instead of a (+) for the reason of the empty div discussed above.

      I question whether (?>) might be of use


      Additionally, I don't know how I would stream the text. It's a parameter passed from a webform.
        Sorry not to reply sooner, but I've been busy.

        Backtracking is simply what the regex engine does when it can't make a match, but still has other items to consider. A simple example is if you want to match /(this|that).*(these|those)/, the engine first looks for a 't', then an 'h', etc. If it finds an 'n' after 'thi', then it backtracks to see if it can match 'that'. In this case, though it might not be nice to look at, breaking it out into four regexes (/this.*these/, /this.*those/, etc) turns out to be more efficient than the alternation version because if it fails to find 'this', for example, it simply fails without trying additional matches.

        Anyway, (?>...) is a way to cut off backtracking for hairy regexes. It can make parsing a lot faster.

Re^2: Recursive Regex
by deMize (Monk) on Mar 12, 2009 at 03:15 UTC
    Just so everyone knows, some changes needed to be made:

    The updated routine:
    sub formatCode{ my $code = shift; my $prev_header = 0; # takes care of closing + divs for (split /\n/, $code) { # No need to remove new +lines if (/^\s*$/) { next } # Skip blank lines elsif ( /^(.*?)\[ ([^\]]*) \](.*)/x ) { # If line is a h +eader... print "$1</div>" if $prev_header; # close prev div, if +there was one... print qq(<div class="$2">$3); # print current div.. +. $prev_header = 1; # and set current div + flag. } else { print "$_<br />" } # Simply print non-head +er lines } print "</div>" if $prev_header; # You always have to cl +ose last div, so just do it here }

    Basically $1 needed to be changed to $_. Also, I added matches for before and after the tag, in case there was inline text. Finally, the line break was added in replace of a newline character. I'm still curious if $prev_header should be reset to 0 after the close div has been called. I guess it's not necessary.


      Update:
      Looking at it now, this is not going to work.
      If I wish to have all the input on one line, I'd still need the look ahead. For example, the above solution will not fix multiple inline statements [head]Title Text[body]Blah Blah Blah

      In replace of the lookahead, I can think of two simple solutions using split: I'd either need to first loop through the string and place an inline character before each [\w*] pattern, or I can delimit on the pattern itself.

      So this is what I came up with:
      sub formatCode2{ my $code = shift; my @arrCode = split (/\[([^\]]*)\]/, $code); my $size = @arrCode; # print whatever b4 1st delimiter if (@arrCode >= 1) { $_ = $arrCode[0]; s/\n/<br \/>/ig; print qq(<div class="">$_</div>); } # print sections for (my $cnt = 1; $cnt < @arrCode; $cnt+=2){ $_ = $arrCode[$cnt+1]; s/\n/<br \/>/ig; print qq(<div class="$arrCode[$cnt]">$_</div>); } }

        Maybe I'm missing something, but why do you want the input all on one line? It seems to me that you're creating the problem you're trying to solve when there wasn't a problem in the first place.

        If you put multiple inline statements on separate lines, you wouldn't have the problem. Since this already resembles a .INI format, why not just take it a little further:

        [section] head=Title Text Blah Blah Blah <--body text

        Specific items (like head, div, section) are called out in some easily parseable fashion, and plain text defaults to body text. Running it all together as one line just makes a huge parsing problem. But in something like the above example, you don't need split() or lookaheads and the parsing is trivial.

        my 2 cents

        --marmot