Lana has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I am working on parsing text and making required substitutions inside it. For example, I have a template text and I am using curly braces inside it in places where text may vary depending on input data:

text text {scope 4 text {scope 2 text {scope 1 text} scope 2 text} scope 4 text {scope 3 text} scope 4 text } text text

The question is how to process the scopes, subscopes in the order I stated in the example sentence? 1-2-3-4? I mean accessing the most inner one and then moving to the top (most outer) scope.

What is the best way to do that?

Thanks! Lana :)

Replies are listed 'Best First'.
Re: Text parsing. Processing scopes and subscopes.
by 1nickt (Canon) on Aug 25, 2015 at 22:50 UTC

    See perlretut and Can I use Perl regular expressions to match balanced text?.

    But first, think about what you are doing. What you seek to do is not simple. Unless this is a homework assignment, I would reconsider the problem and approach it differently. If you are parsing a known format such as XML or HTML then you should use an existing module. If you are parsing a file you created then you should create it with a tree structure instead.

    The way forward always starts with a minimal test.

      Agreed. Unbounded nested scoping requires a state machine with a stack. Every time a new scope is encountered, the current state of the machine is pushed onto the stack and when it ends, the saved stated is popped off.

Re: Text parsing. Processing scopes and subscopes.
by BrowserUk (Patriarch) on Aug 25, 2015 at 23:41 UTC

    Your spec is pretty minimal, but here's one way to achieve your goal:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my $text = do{ local $/; <DATA>; }; my( $n, @scopes ) = 1; 1 while $text =~ s[ ( \{ [^{}]+ \} ) ]{ push @scopes, $1; '_' . $n++ . '_'; }gex; s[scope (\d+) text][processed text $1]g for @scopes; $text =~ s[_${ \( $_+1 ) }_]{ $scopes[ $_ ] }eg for reverse 0 .. $#sco +pes; print $text; __DATA__ text text {scope 4 text {scope 2 text {scope 1 text} scope 2 text} sco +pe 4 text {scope 3 text} scope 4 text } text text

    Output:

    C:\test>1139932 text text {processed text 4 {processed text 2 {processed text 1} proce +ssed text 2} processed text 4 {processed text 3} processed text 4 } t +ext text

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: Text parsing. Processing scopes and subscopes.
by Anonymous Monk on Aug 26, 2015 at 04:38 UTC

    Processed how?

    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1139932 use strict; use warnings; my $in = 'text text {scope 4 text {scope 2 text {scope 1 text} scope 2 + text} scope 4 text {scope 3 text} scope 4 text } text text'; print $_ = $in; print while s/{[^{}]*}/PROCESSED/;
Re: Text parsing. Processing scopes and subscopes.
by graff (Chancellor) on Aug 27, 2015 at 02:04 UTC
    Something about the statement of the task seems very odd to me. One way (I think the "literal" way) to interpret your description would yield a sequence like this:
    0. text {scope 4 {scope 2 {scope 1} scope 2} scope 4 {scope 3} scope +4} text 1. text {scope 4 {scope 2 -change1- scope 2} scope 4 {scope 3} scope +4} text 2. text {scope 4 - - - -c h a n g e 2- - - - scope 4 {scope 3} scope +4} text 3. text {scope 4 - - - -c h a n g e 2- - - - scope 4 -change3- scope +4} text 4. text - - - - - - - - - - - - -c h a n g e 4- - - - - - - - - - - - + - text
    So the question would be: why even bother with the initial nested levels, since later changes will obliterate them? If that's what is supposed to happen, it would make more sense to identify the outer-most bracketing, and apply only that single substitution. Perhaps you intended to describe something different?

    (Update: I suppose that if there were interactions from one step to the next - e.g. if a substitution at step 1 either creates or eliminates a condition that affects what happens in a later stage - then it becomes a more complicated business, posing greater challenges for maintenance.)

Re: Text parsing. Processing scopes and subscopes.
by nuance (Hermit) on Aug 26, 2015 at 13:35 UTC

    The best advice already given is use a template module. If you decide not to, I'd use substr rather than regular expressions

    You can use index to find closing braces, the first one you find will be the close of the innermost scope. Then use the position you found with rindex to find the opening brace of that scope. You can extract the data with substr.

    You can process the extract and then use substr to replace it and the braces in the original data.

    Instead of processing the extract and replacing it, this snippet uses substr to replace the braces. This means you won't continually work on the same scope. It prints the bits it extracted as $extract so you can see it works on scopes in the correct order.

    You need to make sure that whatever you substitute in doesn't have braces.

    #!/usr/bin/perl use strict; use warnings; my $data = <DATA>; chomp $data; while () { my ($rpos, $lpos) = (0, 0); $rpos = index($data, "}", $rpos); last if $rpos < 0; $lpos = rindex($data, "{", $rpos); last if $lpos < 0; # $lpos + 1 so we don't extract the the { # -1 at the end to exclude the } my $extract = substr($data, $lpos + 1, $rpos - $lpos - 1); # get rid of these braces. substr($data, $lpos, 1) = "<"; substr($data, $rpos, 1) = ">"; # if you wanted to replace the entire section including braces # substr($data, $lpos, $rpos - $lpos + 1) = "<>"; print "|${extract}|\n"; } print "|${data}|\n"; __DATA__ text text {scope 4 text {scope 2 text {scope 1 text} scope 2 text} sco +pe 4 text {scope 3 text} scope 4 text } text text
    Output
    |scope 1 text| |scope 2 text <scope 1 text> scope 2 text| |scope 3 text| |scope 4 text <scope 2 text <scope 1 text> scope 2 text> scope 4 text +<scope 3 text> scope 4 text | |text text <scope 4 text <scope 2 text <scope 1 text> scope 2 text> sc +ope 4 text <scope 3 text> scope 4 text > text text|

    Nuance

Re: Text parsing. Processing scopes and subscopes.
by locked_user sundialsvc4 (Abbot) on Aug 26, 2015 at 04:19 UTC

    In addition, may I suggest that you consider using an existing templating tool, such as Template::Toolkit?   Although often used for creating HTML pages, it can in fact be used for anything.   Although you can “roll your own” solution here, maybe this is an all-around better way to do it.   Much more bang for your buck, and nothing to create.