tshabet has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, So I'm using the excellent CPAN module Text::Balance, which is the coolest and most useful module ever. Anyway, I'm using it to parse out balanced blocks of code, such that
{foo is awesome and so is {bar}}
is a match (since the braces are balanced) but
{foo is not {balanced}
will not. Pretty elementary, yes? OK, so here's my problem. I'm using this module in order to convert some code written in Curl into XML. Curl uses || to denote comments, like this:
{center foo} ||this is comment
my problem is that the comment often contains braces, and I of course don't want to end up with a situation where
|| an open brace looks like this { {beginning of a block {some function} end of a block}
does not match. This results in everything after an extra { in comment being unmatched to the end of the input, since a matching } won't be found unless it's a random } in another comment. So my current (not so) brilliant thinking is that I will use a regex like this:
$text =~ s/\|\|(.*?)\n/\<\!\-\-$1\-\-\>\n/g; #turn line comment form i +nto XML comment form $text =~ s/<\!\-\-(.*?)\}(.*?)\-\->/<\!\-\-$1 endbrace $2\-\->/gxs; #e +scape the end braces in comment so they needn't be balanced $text =~ s/<\!\-\-(.*?)\{(.*?)\-\->/<\!\-\-$1 openbrace $2\-\->/gxs; # +ditto for open braces
So far so good, right? OK, so I run these regexes, then run the Text::Balanced routine, then after that is done I search for "endbrace" and "openbrace" and replace them appropriately. So this makes sense in my head, but it does not seem to work in actuality. I was hoping that, for example,
||{paragraph The union of zero or more {glossary citation="type", typ +es } || may be denoted using {ONE-OF }. For example, the || {glossary type expression } {ctext {one-of int float } } {glossa +ry || citation="evaluate", evaluates } to a non- {glossary || representational type } that can be used to {glossary declare } a + {glossary || variable } that can hold either an {INT } or a {FLOAT }. }
Would simply become the same block of code with <!-- in place of || and a --> at the end of each line. Actually, the change of comment indicators seems fine, but identifying each brace seems to be screwing up. Running the above code through my program I get
<!-- { paragraph The union of zero or more <glossary citation="type", + types } --> <!-- may be denoted using { ONE-OF } . For example, the--> <!-- { glossary type expression } <ctext> <one-of> int float </on +e-of> </ctext> <glossary-->> <!-- citation="evaluate", evaluates } to a non- { glossary--> <!-- representational type } that can be used to { glossary declar +e </glossary--> a <glossary-->> <!-- variable } that can hold either an { INT </glossary--> or a +<FLOAT> </FLOAT>. > -->
The <> instead of {} is due to the Text::Balanced recognizing balanced braces, which indicates that my regexes are not working in all occurences. Specifically, they seem to match once per line of comment. Am I making some dumb mistake in my regexes? Is there a better way to get Text::Balanced to ignore braces in ||comment? Any suggestions/pointers much appreciated. Thanks!

Replies are listed 'Best First'.
Re: Problem with skipping comment
by Hofmator (Curate) on Aug 03, 2001 at 15:15 UTC

    I don't know the Text::Balanced module, so I can't tell you if there is a shortcut somewhere. But I can tell you why your regexes only work once.

    First I would suggest leaving out all the unnecessary escapes. This makes your regex much more readable and less toothpicky ... $text =~ s/<!--(.*?)}(.*?)-->/<!--$1 endbrace $2-->/gxs;This matches a whole comment and within the comment the first closing brace is matched and replaced. Due to the /g modifier the regex continues searching, but not at the same starting character but the next one. So the same comment cannot match twice which is required to match multiple closing braces. The same goes for multiple opening braces.

    You could fix that behaviour with the following construct 1 while s///; This repeats the whole substitution until no more substitution could be performed. It does the trick, but can be quite inefficient on large files as the whole text has to be search multiple times. Imagine e.g. a last line like this || }}}}}}}}}}

    That's why I would go a different way for dealing with the comments - line by line simplifies things

    foreach (@lines) { chomp; if (/^\s*\|\|/) { # is a comment s/}/endbrace/g; s/{/openbrace/g; s/^\s*||/<!--/; $_ .= "-->"; } }
    The backsubstitution works similarly if you work on lines again

    Remark: I hope you don't have to deal with comment lines like this || min-->max++; because then things get tricky ...

    -- Hofmator