comment on

How can I process "lazy" XML like our <code> tags? The best solution would work within the Twig framework, but here is a stand-alone preprocessor that does it.

This concept demo below will scan the proto-XML and escape out chars in the elements that are supposed to be literal.

I thought about using Parse::RecDecent, or other parsing technology, but it should be a simple problem. I'm wondering if this general idea, of using cascaded RE's with a continuing "pos", can be improved.

use strict;
use warnings;


sub is_literal ($$)
 {
 my ($name, $attrs)= @_;
 return ($name eq 'listing') || ($name eq 'signature');  # simple demo
+.
 # change this to analyse $name and $attrs to decide whether to treat 
+this literally.
 }

sub escape_out ($)
 {
 my $passage= shift;
 $passage =~ s/&/&amp;/g;
 $passage =~ s/</&lt;/g;
 return "[[[* $passage *]]]";  # [[[]]] to visibly show that the right
+ "bite" was taken.
 }

sub scan ($)
 {
 my @passages;
 my $line= shift;
 # first pass: note what sections need treatment, without actually mod
+ifying the string.
 # modifying the string would mess up the "pos" used by the RE's.
 while ($line =~ m/<\s*(\w+)([^>]*)>/g) {
    # for every start tag...
    my $startpos= pos($line);
    my $name= $1;
    if (is_literal ($name, $2)) {
       # if targeted, find the matching end tag using simple pattern (
+ignoring other stuff).
       # this skips that passage for the continued search of all start
+ tags.
       $line =~ m/<\/$name>/g;
       my $endpos= pos($line);
       unshift @passages, [$startpos, $endpos-(length($name)+3)];
       }
    }
 # second pass: process the sections noted above, from right-to-left s
+o
 # positions don't change.
 foreach my $range (@passages) {
    my ($start, $end)= @$range;
    my $length= $end-$start;
    substr($line, $start, $length)= escape_out (substr($line, $start, 
+$length));
         # is there an easier way to do that without substr'ing twice?
    }
 print $line;
 }

my $testdata= <<'EOF';
   <method name="mainloop">
      <signature virtual="1">int mainloop (ratwin::message::MSG&)</sig
+nature>
      <P>This is the canonocal logic of the message pump.  It looks ap
+roximatly 
      like this:</P>
      <listing>
          use & and <things> in here.
          MSG msg;
          while ( GetMessage(msg) ) {
             if (msg.hwnd == 0)  thread_message (msg);
             else {
                if (!pre_translate (msg)) {  // check IsDialog,  Trans
+lateAccelerator
                   if (!translate_key_even(msg))  // Win32 TranslateMe
+ssage
                      DispatchMessage(msg); 
                   }
                }
             }
          return (msg.wParam);
      </listing>
      <P>Override this if you need to customize this beyond the point 
+provided 
      for by the virtual functions provided for the individual steps.<
+/P>
   </method>
EOF

scan ($testdata);
[download]

In reply to CDATA-like "literal" tags in XML-like data by John M. Dlugosz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.