wiki regex reprocessing replacement

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Task

I need a regex to transform wiki markup surrounding words to html, * to <b> etc.

my problem is that */_ could be combined at word boundaries, see the following example

 DB<66> $_=$wiki; tf();tf();tf() ; print "'$wiki' \n=>\n'$_'" 
'_*one /two/*_ _*three /four/*_ _*five /six/*_'
=>
'<u><b>one <i>two</i></b></u> <u><b>three /four/</b></u> <u><b>five <i
+>six</i></b></u>'
 DB<67>
[download]

'_*one /two/*_ _*three /four/*_ _*five /six/*_'
=>
'one two three /four/ five six'

as you can see I have to run the tf() transformation thrice

 
  DB<40> %h = ( '*'=>'b', '/' => 'i' , '_' => 'u' ) 

  DB<59> sub tf { s{ $pre ([_*/]) (.*?) \2 $post}{$1<$h{$2}>$3</$h{$2}
+>$4}xg }

  DB<62> $pre = qr/(^|\s|>)/

  DB<63> $post = qr/($|\s|<)/    

  DB<65> $wiki='_*one /two/*_ _*three /four/*_ _*five /six/*_'
[download]

Question

Is there a way to make it a one-run transformation?

Trouble is that /g continues after the inserted replacement, here underline

I was experimenting with lookaround-assertions and \G and couldn't get it done.

Approaches

The only ways I can (theoretically) think of so far are

to loop over /g in scalar context while (s///g) { ... } and to manipulate pos
or to manipulate pos in an embedded Perl code (?{...})
to call tf() recursively in the /e evaled replacement part

NB: It's a more theoretical question because running tf() three times doesn't pose problems.

UPDATE:

I just noticed a bug, since four wasn't expanded.

&tf has to be better written with a lookbehind which doesn't consume the next whitespace

  DB<90> sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h
+{$2}>}xg }
[download]

I'll update an SSCCE soon.

Cheers Rolf
_{(addicted to the Perl Programming Language :)

Wikisyntax for the Monastery}

Comment on wiki regex reprocessing replacement Select or Download Code

Replies are listed 'Best First'.
Re: wiki regex reprocessing replacement by tybalt89 (Monsignor) on Feb 15, 2020 at 18:08 UTC
#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11112988 use warnings; my $wiki = '_/one two/ th/ree_ null _/four five/ six_ null _/se_ven eig +ht/ nine_'; my $expected = '<u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></ +i> six</u> null <u><i>se_ven <b>eight</b></i> nine</u>'; my %h = ( '' => 'b' , '/' => 'i' , '_' => 'u' , ); my $html = $wiki =~ s{ (?:^\|\s) \K ([_/]+) \| ([_/]) (?=$\|\s) } { $1 ? $1 =~ s\|.\|<$h{$&}>\|gr : $2 =~ s\|.\|</$h{$&}>\|gr }gexr; print $html eq $expected ? "passed" : "FAILED", "\n\n"; print $wiki, "\n\n", $expected, "\n\n", $html, "\n"; [download] Outputs; `passed _/one two/ th/ree_ null _/four five/ six_ null _/se_ven eight/ +nine_ <u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></i> si +x</u> null <u><i>se_ven <b>eight</b></i> nine</u> <u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></i> si +x</u> null <u><i>se_ven <b>eight</b></i> nine</u>` [download]	[reply] [d/l] [select]
Re^2: wiki regex reprocessing replacement by LanX (Saint) on Feb 15, 2020 at 18:23 UTC
ah yes `\K` not `\G` I keep confusing them. And I thought that $1 and $2 are read-only ... ah I see you use the /r flag. anyway, markup should be paired.° `my $wiki = '_one'; my $expected = $wiki; $html = $wiki =~ s{ (?:^\|\s) \K ([_/]+) \| ([_/]) (?=$\|\s) } { $1 ? $1 =~ s\|.\|<$h{$&}>\|gr : $2 =~ s\|.\|</$h{$&}>\|gr }gexr; print $html eq $expected ? "passed" : "FAILED", "\n\n"; print $wiki, "\n\n", $expected, "\n\n", $html, "\n";` [download] `FAILED _one* _one* <u>one</b>` [download] I've updated the tests in Re: wiki regex reprocessing replacement (UPDATED^2) with markup to ignore Funny enough, the monastery fails too :) FAILED _one* _one* one Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} °) yes I know, wasn't explicitely tested	[reply] [d/l] [select]
Re^3: wiki regex reprocessing replacement by tybalt89 (Monsignor) on Feb 15, 2020 at 18:52 UTC
Let it fail, it should be obvious in the preview. hehehe	[reply]
Re^4: wiki regex reprocessing replacement by LanX (Saint) on Feb 15, 2020 at 19:50 UTC
Re: wiki regex reprocessing replacement by AnomalousMonk (Archbishop) on Feb 16, 2020 at 07:36 UTC
Here's my take. One thing I don't understand is the inclusion of `> <` characters in the pre- and post-markup tag delimiters (update: e.g., `my $pre = qr/(^\|\s\|>)/;` here), probably because I'm not familiar with wikisyntax. Can you link me to a discussion of the role of these characters? I prepared two versions, one using `(?(DEFINE) ...)` and one based purely on `qr//` interpolation. Maybe one is faster, but I haven't done any Benchmark-ing (nor am I likely to). Read more... (10 kB) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: wiki regex reprocessing replacement by LanX (Saint) on Feb 16, 2020 at 11:52 UTC
Wow, thanks :) And the test suite ++ > One thing I don't understand is the inclusion of > < characters in the pre- and post-markup tag delimiters Because the repetitive solution with tf() needs to ignore previous runs. `/_word_/ -> <b>/_word_/</b> -> <b><i>_word_</i></b> -> etc.` The recursive solution with rec() doesn't really need it, one of the reasons why I prefer this approach. > probably because I'm not familiar with wikisyntax. No you are not wrong, there was information missing. In this particular case the syntax is also meant to coexist with more verbose html tags. There are cases where one doesn't want to have a whitespace in between neighboring tags. Just compare Re^3: Good Intentions: Wikisyntax for the Monastery and the complaint about `'ARGV'<br>` not expanding. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: wiki regex reprocessing replacement by AnomalousMonk (Archbishop) on Feb 16, 2020 at 23:47 UTC
You're welcome. Here are some test cases I've added since posting. I'd be very interested to hear your comments, especially as regards the "questionable" ones. '--- tests added 16feb20 after pm#11113014 post ---', '"failing" (i.e., no transformation) tests', [ '' => '', ], [ '' => '', ], [ '_/' => '_/', ], [ ' * _ / ' => ' * _ / ', ], [ 'fail/' => 'fail/', ], [ ' * fail / ' => ' * fail / ', ], 'possibly questionable transformations', [ '__' => '<u></u>', ], [ ' __ ' => ' <u></u> ', ], [ '__ __' => '<u></u> <u></u>', ], [ ' __ __ ' => ' <u></u> <u></u> ', ], [ '____' => '<u></u><u></u>', '???' ], [ ' ____ ' => ' <u></u><u></u> ', '???' ], [ '______' => '<u></u><u></u><u></u>', '???' ], [ ' ______ ' => ' <u></u><u></u><u></u> ', '???' ], [ '________' => '<u></u><u></u><u></u><u></u>', '???' ], [ ' ________ ' => ' <u></u><u></u><u></u><u></u> ', '???' ], [ '__ __ __ __' => '<u></u> <u></u> <u></u> <u></u>', ], [ ' __ __ __ __ ' => ' <u></u> <u></u> <u></u> <u></u> ', ], [download] In this particular case the syntax is also meant to coexist with more verbose html tags. There are cases where one doesn't want to have a whitespace in between neighboring tags. Can you supply some test cases for variations, especially WRT intermixtures with standard HTML? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: wiki regex reprocessing replacement by LanX (Saint) on Feb 17, 2020 at 00:55 UTC
Re: wiki regex reprocessing replacement by LanX (Saint) on Feb 15, 2020 at 15:19 UTC
> I'll update an SSCCE soon. here we go (update: please see Re: wiki regex reprocessing replacement (UPDATED^2) for better testcases including wrong markup) use strict; use warnings; use Data::Dump qw/pp dd/; use Test::More; my $wiki = '_/one two/ three_ null _/four five/ six_ null _/seven eight +/ nine_'; my $expected = '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></i +> six</u> null <u><i>seven <b>eight</b></i> nine</u>'; my $pre = qr/(^\|\s\|>)/; my $post = qr/($\|\s\|<)/; my %h = ( '' => 'b' , '/' => 'i' , '_' => 'u' , ); sub tf { s{ $pre ([_/]) (.?) \2 (?=$post)}{$1<$h{$2}>$3</$h{$2}>}xg +}; $_=$wiki; my $DBG = 1; diag "IN <= '$wiki'\n\n" if $DBG; for my $i (1..3) { tf(); diag "$i: '$_'\n\n" if $DBG; } is($_,$expected," repeated replace works"); done_testing; [download] `# IN <= '_/one two/ three_ null _/four five/ six_ null _/seven e +ight/ nine_' # # 1: '<u>/one two/ three</u> null <u>/four five/ six</u> null <u> +/seven eight/ nine</u>' # # 2: '<u><i>one two</i> three</u> null <u><i>four five</i> six</u> + null <u><i>seven eight</i> nine</u>' # # 3: '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></ +i> six</u> null <u><i>seven <b>eight</b></i> nine</u>' # ok 1 - repeated replace works 1..1` [download] # IN <= '_/one two/ three_ null _/four five/ six_ null _/seven eight/ nine_' # # 1: '/one two/ three null /four five/ six null /seven eight/ nine' # # 2: 'one two* three null four five six null seven eight nine' # # 3: 'one two three null four five six null seven eight nine' # ok 1 - repeated replace works 1..1 Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: wiki regex reprocessing replacement by hippo (Bishop) on Feb 15, 2020 at 16:59 UTC
Here's a non-recursive way which I think fits your criteria: use strict; use warnings; use Test::More; my $wiki = '_/one two/ three_ null _/four five/ six_ null _/seven eight +/ nine_'; my $expected = '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></i +> six</u> null <u><i>seven <b>eight</b></i> nine</u>'; my %h = ( '' => 'b' , '/' => 'i' , '_' => 'u' , ); my $DBG = 1; sub flip { my $s = shift; my $z = $h{$s}; $h{$s} = $z =~ /\// ? substr ($z, 1, 1) : "/$z"; return "<$z>"; } sub tf { diag "Pre: '$_'\n\n" if $DBG; s{([_/])}{flip($1)}eg }; $_ = $wiki; diag "IN <= '$wiki'\n\n" if $DBG; tf(); is ($_, $expected, " repeated replace works"); done_testing; [download]	[reply] [d/l]
Re^3: wiki regex reprocessing replacement by LanX (Saint) on Feb 15, 2020 at 17:12 UTC
Many thanks, :) ... but ... The testsuite should have also included markup which must not be replaced My fault sorry, I thought it's obvious by the $pre and $post regex. The markup must come in pairs and be embraced by special word boundaries. (whitespace or other markup or tag-brackets or ... depending on pre/post) Hence a _ inside a word is forbidden, which makes sense for `joined_identifiers` . I've updated the tests in Re: wiki regex reprocessing replacement (UPDATED^2) with markup to ignore Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^4: wiki regex reprocessing replacement by hippo (Bishop) on Feb 15, 2020 at 18:16 UTC
Re: wiki regex reprocessing replacement (UPDATED^2) by LanX (Saint) on Feb 15, 2020 at 16:55 UTC
UPDATES: Expanded test case with wrong markup to be ignored markup inside word wrong pairs like `fail_` > to call tf() recursively in the /e evaled replacement part* seems to work well, see sub rec use strict; use warnings; use Data::Dump qw/pp dd/; use Test::More; my $wiki = '_/one two/ th/ree_ null _/fur five/ six_ null _/se_ven eig +ht/ nine_ fail_'; my $expected = '<u><i>one <b>two</b></i> th/ree</u> null <u><i>fur <b>five</b></ +i> six</u> null <u><i>se_ven <b>eight</b></i> nine</u> fail_'; my $pre = qr/(^\|\s\|>)/; my $post = qr/($\|\s\|<)/; my %h = ( '' => 'b' , '/' => 'i' , '_' => 'u' , ); sub tf { s{ $pre ([_/]) (.?) \2 (?=$post)}{$1<$h{$2}>$3</$h{$2}>}xg +}; $_=$wiki; my $DBG = 0; diag "IN <= '$wiki'\n\n" if $DBG; for my $i (1..3) { tf(); diag "$i: '$_'\n\n" if $DBG; } is($_,$expected,"repeated replace"); my $rec_level=0; sub rec { my ($txt) = @_; my $DBG = 1; diag ++$rec_level ."< $txt" if $DBG; $txt =~ s{ $pre ([_/]) (.?) \2 (?=$post) }{ my $tag = $h{$2}; "$1<$tag>" . rec($3). "</$tag>" }xge; diag $rec_level-- . "> $txt\n" if $DBG; return $txt; } my $got_rec = rec($wiki); is($got_rec,$expected,"recursive replace"); done_testing; [download] ok 1 - repeated replace # 1< _/one two/ th/ree_ null _/fur five/ six_ null _/se_ven eight/ nine_ fail_ # 2< /one two/ th/ree # 3< one two* # 4< two # 4> two # 3> one two # 2> one two th/ree # 2< /fur five/ six # 3< fur five # 4< five # 4> five # 3> fur five* # 2> fur five* six # 2< /se_ven eight/ nine # 3< se_ven eight # 4< eight # 4> eight # 3> se_ven eight # 2> se_ven eight nine # 1> one two th/ree null fur five* six null se_ven eight nine *fail_ ok 2 - recursive replace 1..2 Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks