The "inline formatting" (as opposed to block-level constructs) is what this grammar is for. It has to handle things like //italic// and **bold**, and understand that '//' might be part of an implicit link (URL) and not an italic open/close indicator. Similarly, ~ is used for escape and also can be in front of something that looks like a URL to prevent it from being auto-link-ified.
The monster regexp is built with this code:
The _build_parser_spec is a lazy Moose Builder. It splices together all the rules and inserts it into the rest of the expression. build_parser_rules is also a Builder, that just puts together the array. This array can be extended by add-ins.my $named_link= q{ \s* (?<linkspec>[^|]*?) \s* (?: (?<pipe>\|) (?<na +me>.*?) \s* )? }; sub build_parser_rules { my @parts; push @parts, [ 70, q{ \{{3} \s* (?<body>.*?) \s* \}{3} (*:nowiki) } +]; push @parts, [ 80, qr{ \{{2} $named_link \}{2} (*:image) }xs ]; push @parts, [ 90, qr{ \<{3} \s* (?<body>.*?) \s* \>{3} (*:placeholde +r) }xs ]; push @parts, [ 40, q{// \s* (?<body>(?: (?&link) | . )*?) \s* (?: +(?: (?<!~)//) | \Z)(*:italic) } ]; # special rules for //, skip any + links in body. push @parts, [ 30, q{~ (?<body> (?&link)|.|\Z ) (*:escape)} ]; push @parts, [ 60, qr{\\\\ (*:break)}x ]; return \@parts; } method formulate_link_rule { # formulate the 'link' rule, which includes link_prefixes which are s +et after construction. # So this is used at the last moment before the parser is created. my $linkprefix= join "|", map { quotemeta($_) } @{$self->link_prefixe +s}; my $blend= $self->get_parse_option('blended_links') ? q{ (?(<pipe>)|(?<blendsuffix> \w+)? ) } : ''; my $link= qr{ (?<link> (?: \[{2} $named_link \]{2} # explicit use of brackets $blend # blend suffix ) | (?: (?<linkspec>(?: $linkprefix )://\S+?) (?= [,.?!:;"']? # ❝Single punctuation characters + (,.?!:;"') at the end of URLs should not be considered part of the U +RL.❞ (?: \Z|\s ) ) # since I used a lazy quantifier to allow +the trailing punctuation, need to know how to end. ) )(*:link) }x; return [ 10, $link ]; } method formulate_simples_rule { # formulate the 'simples' rule, which includes simple_format_tags whi +ch are set after construction. # simple_format_tags are qw[** //] for standard Creole, and can have +extensions such as qw/__ ## ^^/ for underlined, monospaced, superscri +pt, etc. "simple" means same open and close and maps to a html tag. # So this is used at the last moment before the parser is created. my $simples= join "|", map { $_ eq '//' ? () : quotemeta($_) } (keys + %{$self->simple_format_tags}); return [ 50, q{(?<simple> (?:} . $simples . q{))\s*(?<body>.*?) \s* ( +?: (?: (?<!~)\k<simple>) | \Z) (*:simple)} ]; } method get_final_parser_rules { my $parser_rules= $self->parser_rules; push @$parser_rules, $self->formulate_link_rule, $self->formulate_sim +ples_rule; return $parser_rules } method _build_parser_spec { my $parser_rules= $self->get_final_parser_rules; my $branches_string= join "\n | ", map { my $x= $$_[1]; ref $x ? $x : "(?: $x )" } (sort { $a->[0] <=> $b->[0] } @$parser_rules); my $ps= qr{(?<prematch>.*?) (?: $branches_string | \Z (*:nada) # must be the last branch ) }xs; return $ps; }
The function that uses the resulting regexp is:
This uses $REGMARK to know which branch was taken and converts that to an Action in $meth. The %+ hash contains the captures, which need to be saved off before any use of the regexp engine occurs again.method do_format (Str $line) { my @results; my $ps= $self->_parser_spec; while ($line =~ /$ps/g) { my %captures= %+; my $regmark= $REGMARK; my $prematch= $captures{prematch}; push @results, $self->escape($self->filter($prematch)) unless len +gth($prematch)==0; unless ($regmark eq 'nada') { my $meth= "grammar_branch_$regmark"; push @results, $self->$meth (\%captures); } } return join ('', @results); }
Finally, the resting text is gathered as individual generated pieces into an array, and joined at the very end. I think it is more efficient to join when I know how long it is finally, then to keep appending small parts. There is no reason to keep it as a contiguous string as I go.
In reply to Re: Some Results on Composing Complex Regexes
by John M. Dlugosz
in thread Some Results on Composing Complex Regexes
by John M. Dlugosz
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |