comment on

If you want to know more about this parser, this note includes some details. The project can be found at https://github.com/jdlugosz/Text--Creole. This will be a library for parsing Wiki Creole and producing xhtml or other output.

The "inline formatting" (as opposed to block-level constructs) is what this grammar is for. It has to handle things like //italic// and **bold**, and understand that '//' might be part of an implicit link (URL) and not an italic open/close indicator. Similarly, ~ is used for escape and also can be in front of something that looks like a URL to prevent it from being auto-link-ified.

The monster regexp is built with this code:

my $named_link= q{ \s* (?<linkspec>[^|]*?) \s* (?:  (?<pipe>\|)  (?<na
+me>.*?)  \s* )? };

sub build_parser_rules
 {
 my @parts;
 push @parts, [ 70, q{ \{{3} \s* (?<body>.*?) \s* \}{3} (*:nowiki)  } 
+];
 push @parts, [ 80, qr{ \{{2} $named_link \}{2} (*:image)   }xs ];
 push @parts, [ 90, qr{ \<{3} \s* (?<body>.*?) \s* \>{3} (*:placeholde
+r)  }xs ];
 push @parts, [ 40, q{// \s* (?<body>(?: (?&link)  | . )*?)  \s*  (?: 
+(?: (?<!~)//) | \Z)(*:italic) } ];   # special rules for //, skip any
+ links in body.
 push @parts, [ 30, q{~ (?<body> (?&link)|.|\Z  ) (*:escape)} ];
 push @parts, [ 60, qr{\\\\ (*:break)}x ];
 return \@parts;
 }
    
method formulate_link_rule
 {
 # formulate the 'link' rule, which includes link_prefixes which are s
+et after construction.
 # So this is used at the last moment before the parser is created.
 my $linkprefix= join "|", map { quotemeta($_) } @{$self->link_prefixe
+s};
 my $blend= $self->get_parse_option('blended_links') ?
    q{ (?(<pipe>)|(?<blendsuffix> \w+)?  ) }
    : '';
 my $link= qr{
       (?<link>
          (?: \[{2} $named_link \]{2}  # explicit use of brackets  
             $blend  # blend suffix
             )  
          | (?:  (?<linkspec>(?: $linkprefix )://\S+?) 
             (?= [,.?!:;"']?   # &#10077;Single punctuation characters
+ (,.?!:;"') at the end of URLs should not be considered part of the U
+RL.&#10078;
             (?: \Z|\s ) )  # since I used a lazy quantifier to allow 
+the trailing punctuation, need to know how to end.
             )
       )(*:link)
    }x;
 return [ 10, $link ];
 }
 
method formulate_simples_rule
 {
 # formulate the 'simples' rule, which includes simple_format_tags whi
+ch are set after construction.
 # simple_format_tags are qw[** //] for standard Creole, and can have 
+extensions such as qw/__ ## ^^/ for underlined, monospaced, superscri
+pt, etc.  "simple" means same open and close and maps to a html tag.
 # So this is used at the last moment before the parser is created.
 my $simples= join "|", map { $_ eq '//' ? () : quotemeta($_) }  (keys
+ %{$self->simple_format_tags});
 return [ 50, q{(?<simple> (?:} . $simples . q{))\s*(?<body>.*?) \s* (
+?: (?: (?<!~)\k<simple>) | \Z)  (*:simple)} ];
 }

method get_final_parser_rules
 {
 my $parser_rules= $self->parser_rules;
 push @$parser_rules, $self->formulate_link_rule, $self->formulate_sim
+ples_rule;
 return $parser_rules
 } 

method _build_parser_spec
 {
 my $parser_rules= $self->get_final_parser_rules;
 my $branches_string= join "\n | ", map {  
    my $x= $$_[1];
    ref $x ? $x : "(?: $x )"
    } (sort { $a->[0] <=> $b->[0] } @$parser_rules);
 my $ps= qr{(?<prematch>.*?)
     (?: $branches_string
     | \Z (*:nada)  # must be the last branch
    ) }xs;
 return $ps;
 }
[download]

The _build_parser_spec is a lazy Moose Builder. It splices together all the rules and inserts it into the rest of the expression. build_parser_rules is also a Builder, that just puts together the array. This array can be extended by add-ins.

The function that uses the resulting regexp is:

method do_format (Str $line)
 {
 my @results;
 my $ps= $self->_parser_spec;
 while ($line =~ /$ps/g) {
    my %captures= %+;
    my $regmark= $REGMARK;
    my $prematch= $captures{prematch};
    push @results, $self->escape($self->filter($prematch))  unless len
+gth($prematch)==0;
    unless ($regmark eq 'nada') {
       my $meth= "grammar_branch_$regmark";
       push @results, $self->$meth (\%captures);
       }
    }
 return join ('', @results);
 }
[download]

This uses $REGMARK to know which branch was taken and converts that to an Action in $meth. The %+ hash contains the captures, which need to be saved off before any use of the regexp engine occurs again.

Finally, the resting text is gathered as individual generated pieces into an array, and joined at the very end. I think it is more efficient to join when I know how long it is finally, then to keep appending small parts. There is no reason to keep it as a contiguous string as I go.

In reply to Re: Some Results on Composing Complex Regexes by John M. Dlugosz
in thread Some Results on Composing Complex Regexes by John M. Dlugosz

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.