deMize has asked for the wisdom of the Perl Monks concerning the following question:

An attempt:
This is an example to explain what I'm trying to do:
my $code = qq( [head] Head text... [body] Body text... [something else] more text.. ); #example of text $code =~ s/\n//g; #remove all the newlines $code .= "\n"; #add one to the end $formatCode($code); sub formatCode{ my $str = $_[0]; $str =~ s#\[(.*?)\](.*?)(\[.*?\]|\n)#<div class="$1">$2</div>#ig +; print $str; }
Feel free to optimize the RegEx too.. I'm a novice and can use wonderful shortcuts :)

So I wanna put each section in a DIV. It knows when the section ends because it hits a new section or a line break.

I tried sticking a $3 at the end of the replace side, but it didn't work. I don't know how to do this recursively without doing a loop until a match can't be found.

Any help is appreciated, thanks.


Solution:
I forgot how to do the lookahead:
$str =~ s#\[(.*?)\](.*?)(?=\[.*?\]|\n)#<div class="$1">$2</div>#ig;

Replies are listed 'Best First'.
Re: Recursive Regex
by moritz (Cardinal) on Mar 11, 2009 at 07:52 UTC
    Just a small hint: when you have something of the form ^\[ .*? \] the regex engine can backtrack over the ] char, which might not be what you want. For example it can match the whole string of [a][b] if something after that part of the regex causes backtracking.

    Instead you can use this regex: ^\[ [^\]]* \]

    Which always matches an opening bracket, other characters and then the closing bracket, so in the example above it will never match more than the [a]

    (whitespaces added for clarity, you need the /x modifier if you want to keep them).

Re: Recursive Regex
by johngg (Canon) on Mar 11, 2009 at 09:46 UTC

    While you are on the subject of lookaheads, this

    $code =~ s/\n//g; #remove all the newlines $code .= "\n"; #add one to the end

    could be replaced by this

    $code =~ s/\n(?=.)//g # remove all but the last newline

    I hope this is of interest.

    Cheers,

    JohnGG

      Your replacement isn't equivalent to the code you claim it can replace. First of all, your code will not remove a newline followed by a newline - you'd need (?=(?s:.)) for that, or the /s modifier. Second, the original code will always let $code end with a newline; regardless whether it ended with a newline. In your replacement, there will only be a trailing newline in $code if there was one originally.

        Good catch re. the s modifier, I missed that. Thanks for the correction.

        With regard to your second point, from the way the OP initialised $code I don't think always ending with a newline was the requirement s/he was addressing. For the more general case you are correct.

        Cheers,

        JohnGG

        A reply falls below the community's threshold of quality. You may see it by logging in.
        So I'm thankful for the help!

        First off: deMize « he »

        Second, I'm not sure what I did was the best way of going about things. What I'm doing is just building a CMS insertion page with as little markup as possible, where each section is contained in it's own div.

        Response:
        The first problem was getting each into their own div, which the lookahead helped. To accomplish this, it just ends the current div when it hits a new section (may be a subsection in the future).

        The reason for removing the line breaks was because I was using that as a delimiter for the last section --- there is probably a better way of doing that with determining the end of the string in the RegEx (maybe $), but I need to replace those line breaks with HTML breaks anyhow.

        The problem:
        This means that it is going to be linear with no sub-divs. I might have to rethink that for later, because I might want to have something like this later:
        [section] [top]Top Data [middle]Mid Data [bottom]Bottom Data [section] [top]Top Data [bottom]Bottom Data

        Should result to:
        <div class="section"> <div class="top">Top Data</div> <div class="middle">Middle Data</div> <div class="bottom">Bottom Data</div> </div> <div class="section"> <div class="top">Top Data</div> <div class="bottom">Bottom Data</div> </div>

        What I plan to do is store either the sub sections in an array or the sections in an array. I could probably use help with a better algorithm.
        The whole purpose of this was so that I could quickly type the data into one input box, without building a whole intricate interface (that can come later).




Re: Recursive Regex
by furry_marmot (Pilgrim) on Mar 11, 2009 at 18:12 UTC
    deMize,

    Try this code:

    my $code = <<EOT; [head] Head text... [body] Body text... [something else] more text.. EOT $code =~ s/[\n ]+?(?=\S)//sg; #remove all newlines but last formatCode($code); sub formatCode { my $str = shift; $str =~ s{ \[ ([^\]]+) \] ([^\]]*) (?=\[|\n)} {<div class="$1">$2</div>}igx; print $str; }

    I took your code, mixed in some of the suggestions of others, and made some changes I hope you find useful (since you asked). First, I changed the quoting to a "here doc", which is more like the file you'll probably be reading from in most situations.

    Then I made a slight change to the way suggested below of removing the newlines, using a zero-width positive lookahead assertion (look up "?=" in perldoc perlre). The changes I made also accommodate blank lines in the source, which will help with readability.

    Finally, in the regex, I used the /x modifier and curly braces to aid in the readability, and used a class, so you're now looking for "bracket, (not a bracket)+, bracket, (not a bracket)*, stop before a bracket".

    Using "+" in the first capture and "*" in the second capture enforces an assumption (that you should think about and modify to suit) that brackets ALWAYS have a div name in them, but there might be no text after it. If there must always be text after a div name, use the "+" in both cases.

    Since you are a professed newbie, I'll tell you that this is a common technique when you're learning. Later, you'll find it more practical (especially with large files) to process a stream, rather than manipulate a huge string. So, I'll leave you with this:

    my $code = <<EOT; [head] Head text... [body] Body text... [something else] more text.. EOT $prev_header = 0; # takes care of closing divs for (split /\n/, $code) { # No need to remove newlines if (/^\s*$/) { next } # Skip blank lines elsif ( /^\[ ([^\]]+) \]/x ) { # If line is a header... print "</div>" if $prev_header; # close prev div, if there w +as one... print qq(<div class="$1">); # print current div... $prev_header = 1; # and set current div flag. } else { print "$1 " } # Simply print non-header line +s } print "</div>"; # You always have to close last div, so just do it her +e

    To read from a file:

    open, $FH, "<whatever.txt"; for (<$FH>) { chomp; ... }

      Nice remarks! I will have to try this when I get home.


      As for the stream, you are right. Also, I wouldn't say I'm a newbie and I guess novice was incorrect to say as well; if I had to call my level something, I guess I should have said a previous amateur, never reaching the expert or monkism, but a little better than a neophyte.

      I've done much of this stuff before, but have since forgot. Hence, me opening the thread looking for help, but then remembering the "lookahead" was what I was after.

      The issue here is that I'm going to have to manipulate this in the future. For simplicity I made the delimiters: ([]) and (\n), which may or may not have nested text. So, I used (.*) on purpose because I might want to include something in a blank div for formatting.

      I know I could just write everything in HTML-like syntax, or possibly some made up pseudo-SGML, like Wiki, but that would both take the fun away and increase my typing time. -- Something just interests me about having one label.

      One of my concerns is that I wanted it to be efficient. I think streaming is the best choice - no question with lengthy strings.

      I don't recall what backtracking is, so I'm going to have to evaluate what moritz was talking about with: /^\[ ([^\]]+) \]/x I vaguely remember an issue like what he said with ab, but for some reason I thought the non-greedy (?) would take that away.

        Okay so after looking at it, it looks like: ([^\]]+) is basically saying, match anything not a close-bracket. I would use a (*) instead of a (+) for the reason of the empty div discussed above.

        I question whether (?>) might be of use


        Additionally, I don't know how I would stream the text. It's a parameter passed from a webform.
      Just so everyone knows, some changes needed to be made:

      The updated routine:
      sub formatCode{ my $code = shift; my $prev_header = 0; # takes care of closing + divs for (split /\n/, $code) { # No need to remove new +lines if (/^\s*$/) { next } # Skip blank lines elsif ( /^(.*?)\[ ([^\]]*) \](.*)/x ) { # If line is a h +eader... print "$1</div>" if $prev_header; # close prev div, if +there was one... print qq(<div class="$2">$3); # print current div.. +. $prev_header = 1; # and set current div + flag. } else { print "$_<br />" } # Simply print non-head +er lines } print "</div>" if $prev_header; # You always have to cl +ose last div, so just do it here }

      Basically $1 needed to be changed to $_. Also, I added matches for before and after the tag, in case there was inline text. Finally, the line break was added in replace of a newline character. I'm still curious if $prev_header should be reset to 0 after the close div has been called. I guess it's not necessary.


        Update:
        Looking at it now, this is not going to work.
        If I wish to have all the input on one line, I'd still need the look ahead. For example, the above solution will not fix multiple inline statements [head]Title Text[body]Blah Blah Blah

        In replace of the lookahead, I can think of two simple solutions using split: I'd either need to first loop through the string and place an inline character before each [\w*] pattern, or I can delimit on the pattern itself.

        So this is what I came up with:
        sub formatCode2{ my $code = shift; my @arrCode = split (/\[([^\]]*)\]/, $code); my $size = @arrCode; # print whatever b4 1st delimiter if (@arrCode >= 1) { $_ = $arrCode[0]; s/\n/<br \/>/ig; print qq(<div class="">$_</div>); } # print sections for (my $cnt = 1; $cnt < @arrCode; $cnt+=2){ $_ = $arrCode[$cnt+1]; s/\n/<br \/>/ig; print qq(<div class="$arrCode[$cnt]">$_</div>); } }