LanX has asked for the wisdom of the Perl Monks concerning the following question:
Hi
Task
I need a regex to transform wiki markup surrounding words to html, * to <b> etc.
my problem is that */_ could be combined at word boundaries, see the following example
DB<66> $_=$wiki; tf();tf();tf() ; print "'$wiki' \n=>\n'$_'"
'_*one /two/*_ _*three /four/*_ _*five /six/*_'
=>
'<u><b>one <i>two</i></b></u> <u><b>three /four/</b></u> <u><b>five <i
+>six</i></b></u>'
DB<67>
'_*one /two/*_ _*three /four/*_ _*five /six/*_'
=>
'one two three /four/ five six'
as you can see I have to run the tf() transformation thrice
DB<40> %h = ( '*'=>'b', '/' => 'i' , '_' => 'u' )
DB<59> sub tf { s{ $pre ([_*/]) (.*?) \2 $post}{$1<$h{$2}>$3</$h{$2}
+>$4}xg }
DB<62> $pre = qr/(^|\s|>)/
DB<63> $post = qr/($|\s|<)/
DB<65> $wiki='_*one /two/*_ _*three /four/*_ _*five /six/*_'
Question
Is there a way to make it a one-run transformation?
Trouble is that /g continues after the inserted replacement, here underline
I was experimenting with lookaround-assertions and \G and couldn't get it done.
Approaches
The only ways I can (theoretically) think of so far are
- to loop over /g in scalar context while (s///g) { ... } and to manipulate pos
- or to manipulate pos in an embedded Perl code (?{...})
- to call tf() recursively in the /e evaled replacement part
NB: It's a more theoretical question because running tf() three times doesn't pose problems.
UPDATE:
I just noticed a bug, since four wasn't expanded.
&tf has to be better written with a lookbehind which doesn't consume the next whitespace
DB<90> sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h
+{$2}>}xg }
I'll update an SSCCE soon.
Re: wiki regex reprocessing replacement
by tybalt89 (Monsignor) on Feb 15, 2020 at 18:08 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11112988
use warnings;
my $wiki =
'_/one *two*/ th/ree_ null _/four *five*/ six_ null _/se_ven *eig
+ht*/ nine_';
my $expected =
'<u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></
+i> six</u> null <u><i>se_ven <b>eight</b></i> nine</u>';
my %h = (
'*' => 'b' ,
'/' => 'i' ,
'_' => 'u' ,
);
my $html = $wiki =~ s{ (?:^|\s) \K ([*_/]+) | ([*_/]*) (?=$|\s) }
{ $1 ? $1 =~ s|.|<$h{$&}>|gr : $2 =~ s|.|</$h{$&}>|gr }gexr;
print $html eq $expected ? "passed" : "FAILED", "\n\n";
print $wiki, "\n\n", $expected, "\n\n", $html, "\n";
Outputs;
passed
_/one *two*/ th/ree_ null _/four *five*/ six_ null _/se_ven *eight*/
+nine_
<u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></i> si
+x</u> null <u><i>se_ven <b>eight</b></i> nine</u>
<u><i>one <b>two</b></i> th/ree</u> null <u><i>four <b>five</b></i> si
+x</u> null <u><i>se_ven <b>eight</b></i> nine</u>
| [reply] [d/l] [select] |
|
ah yes \K not \G I keep confusing them.
And I thought that $1 and $2 are read-only ... ah I see you use the /r flag.
anyway, markup should be paired.°
my $wiki =
'_one*';
my $expected = $wiki;
$html = $wiki =~ s{ (?:^|\s) \K ([*_/]+) | ([*_/]*) (?=$|\s) }
{ $1 ? $1 =~ s|.|<$h{$&}>|gr : $2 =~ s|.|</$h{$&}>|gr }gexr;
print $html eq $expected ? "passed" : "FAILED", "\n\n";
print $wiki, "\n\n", $expected, "\n\n", $html, "\n";
FAILED
_one*
_one*
<u>one</b>
I've updated the tests in Re: wiki regex reprocessing replacement (UPDATED^2) with markup to ignore
Funny enough, the monastery fails too :)
FAILED
_one*
_one*
one
°) yes I know, wasn't explicitely tested | [reply] [d/l] [select] |
|
| [reply] |
|
Re: wiki regex reprocessing replacement
by AnomalousMonk (Archbishop) on Feb 16, 2020 at 07:36 UTC
|
Here's my take. One thing I don't understand is the inclusion of > < characters in the pre- and post-markup tag delimiters (update: e.g., my $pre = qr/(^|\s|>)/; here), probably because I'm not familiar with wikisyntax. Can you link me to a discussion of the role of these characters? I prepared two versions, one using (?(DEFINE) ...) and one based purely on qr// interpolation. Maybe one is faster, but I haven't done any Benchmark-ing (nor am I likely to).
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
Wow, thanks :)
And the test suite ++
> One thing I don't understand is the inclusion of > < characters in the pre- and post-markup tag delimiters
Because the repetitive solution with tf() needs to ignore previous runs.
*/_word_/* -> <b>/_word_/</b> -> <b><i>_word_</i></b> -> etc.
The recursive solution with rec() doesn't really need it, one of the reasons why I prefer this approach.
> probably because I'm not familiar with wikisyntax.
No you are not wrong, there was information missing.
In this particular case the syntax is also meant to coexist with more verbose html tags.
There are cases where one doesn't want to have a whitespace in between neighboring tags.
Just compare Re^3: Good Intentions: Wikisyntax for the Monastery and the complaint about 'ARGV'<br> not expanding.
| [reply] [d/l] [select] |
|
'--- tests added 16feb20 after pm#11113014 post ---',
'"failing" (i.e., no transformation) tests',
[ '' => '', ],
[ '*' => '*', ],
[ '*_/' => '*_/', ],
[ ' * _ / ' => ' * _ / ', ],
[ '*fail/' => '*fail/', ],
[ ' * fail / ' => ' * fail / ', ],
'possibly questionable transformations',
[ '__' => '<u></u>', ],
[ ' __ ' => ' <u></u> ', ],
[ '__ __' => '<u></u> <u></u>', ],
[ ' __ __ ' => ' <u></u> <u></u> ', ],
[ '____' => '<u></u><u></u>', '???' ],
[ ' ____ ' => ' <u></u><u></u> ', '???' ],
[ '______' => '<u></u><u></u><u></u>', '???' ],
[ ' ______ ' => ' <u></u><u></u><u></u> ', '???' ],
[ '________' => '<u></u><u></u><u></u><u></u>', '???' ],
[ ' ________ ' => ' <u></u><u></u><u></u><u></u> ', '???' ],
[ '__ __ __ __' => '<u></u> <u></u> <u></u> <u></u>', ],
[ ' __ __ __ __ ' => ' <u></u> <u></u> <u></u> <u></u> ', ],
In this particular case the syntax is also meant to coexist with more verbose html tags.
There are cases where one doesn't want to have a whitespace in between neighboring tags.
Can you supply some test cases for variations, especially WRT intermixtures with standard HTML?
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
Re: wiki regex reprocessing replacement
by LanX (Saint) on Feb 15, 2020 at 15:19 UTC
|
use strict;
use warnings;
use Data::Dump qw/pp dd/;
use Test::More;
my $wiki =
'_/one *two*/ three_ null _/four *five*/ six_ null _/seven *eight
+*/ nine_';
my $expected =
'<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></i
+> six</u> null <u><i>seven <b>eight</b></i> nine</u>';
my $pre = qr/(^|\s|>)/;
my $post = qr/($|\s|<)/;
my %h = (
'*' => 'b' ,
'/' => 'i' ,
'_' => 'u' ,
);
sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h{$2}>}xg
+};
$_=$wiki;
my $DBG = 1;
diag "IN <= '$wiki'\n\n" if $DBG;
for my $i (1..3) {
tf();
diag "$i: '$_'\n\n" if $DBG;
}
is($_,$expected," repeated replace works");
done_testing;
# IN <= '_/one *two*/ three_ null _/four *five*/ six_ null _/seven *e
+ight*/ nine_'
#
# 1: '<u>/one *two*/ three</u> null <u>/four *five*/ six</u> null <u>
+/seven *eight*/ nine</u>'
#
# 2: '<u><i>one *two*</i> three</u> null <u><i>four *five*</i> six</u>
+ null <u><i>seven *eight*</i> nine</u>'
#
# 3: '<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></
+i> six</u> null <u><i>seven <b>eight</b></i> nine</u>'
#
ok 1 - repeated replace works
1..1
# IN <= '_/one *two*/ three_ null _/four *five*/ six_ null _/seven *eight*/ nine_'
#
# 1: '/one *two*/ three null /four *five*/ six null /seven *eight*/ nine'
#
# 2: 'one *two* three null four *five* six null seven *eight* nine'
#
# 3: 'one two three null four five six null seven eight nine'
#
ok 1 - repeated replace works
1..1
| [reply] [d/l] [select] |
|
use strict;
use warnings;
use Test::More;
my $wiki =
'_/one *two*/ three_ null _/four *five*/ six_ null _/seven *eight
+*/ nine_';
my $expected =
'<u><i>one <b>two</b></i> three</u> null <u><i>four <b>five</b></i
+> six</u> null <u><i>seven <b>eight</b></i> nine</u>';
my %h = (
'*' => 'b' ,
'/' => 'i' ,
'_' => 'u' ,
);
my $DBG = 1;
sub flip {
my $s = shift;
my $z = $h{$s};
$h{$s} = $z =~ /\// ? substr ($z, 1, 1) : "/$z";
return "<$z>";
}
sub tf {
diag "Pre: '$_'\n\n" if $DBG;
s{([_*/])}{flip($1)}eg
};
$_ = $wiki;
diag "IN <= '$wiki'\n\n" if $DBG;
tf();
is ($_, $expected, " repeated replace works");
done_testing;
| [reply] [d/l] |
|
Many thanks, :)
... but ...
The testsuite should have also included markup which must not be replaced
My fault sorry, I thought it's obvious by the $pre and $post regex.
The markup must come in pairs and be embraced by special word boundaries.
(whitespace or other markup or tag-brackets or ... depending on pre/post)
Hence a _ inside a word is forbidden, which makes sense for joined_identifiers .
I've updated the tests in Re: wiki regex reprocessing replacement (UPDATED^2) with markup to ignore
| [reply] [d/l] |
|
Re: wiki regex reprocessing replacement (UPDATED^2)
by LanX (Saint) on Feb 15, 2020 at 16:55 UTC
|
UPDATES:
Expanded test case with wrong markup to be ignored
- markup inside word
- wrong pairs like *fail_
> to call tf() recursively in the /e evaled replacement part
seems to work well, see sub rec
use strict;
use warnings;
use Data::Dump qw/pp dd/;
use Test::More;
my $wiki =
'_/one *two*/ th/ree_ null _/f*ur *five*/ six_ null _/se_ven *eig
+ht*/ nine_ *fail_';
my $expected =
'<u><i>one <b>two</b></i> th/ree</u> null <u><i>f*ur <b>five</b></
+i> six</u> null <u><i>se_ven <b>eight</b></i> nine</u> *fail_';
my $pre = qr/(^|\s|>)/;
my $post = qr/($|\s|<)/;
my %h = (
'*' => 'b' ,
'/' => 'i' ,
'_' => 'u' ,
);
sub tf { s{ $pre ([_*/]) (.*?) \2 (?=$post)}{$1<$h{$2}>$3</$h{$2}>}xg
+};
$_=$wiki;
my $DBG = 0;
diag "IN <= '$wiki'\n\n" if $DBG;
for my $i (1..3) {
tf();
diag "$i: '$_'\n\n" if $DBG;
}
is($_,$expected,"repeated replace");
my $rec_level=0;
sub rec {
my ($txt) = @_;
my $DBG = 1;
diag ++$rec_level ."< $txt" if $DBG;
$txt =~
s{
$pre ([_*/]) (.*?) \2 (?=$post)
}{
my $tag = $h{$2};
"$1<$tag>" . rec($3). "</$tag>"
}xge;
diag $rec_level-- . "> $txt\n" if $DBG;
return $txt;
}
my $got_rec = rec($wiki);
is($got_rec,$expected,"recursive replace");
done_testing;
ok 1 - repeated replace
# 1< _/one *two*/ th/ree_ null _/f*ur *five*/ six_ null _/se_ven *eight*/ nine_ *fail_
# 2< /one *two*/ th/ree
# 3< one *two*
# 4< two
# 4> two
# 3> one two
# 2> one two th/ree
# 2< /f*ur *five*/ six
# 3< f*ur *five*
# 4< five
# 4> five
# 3> f*ur five
# 2> f*ur five six
# 2< /se_ven *eight*/ nine
# 3< se_ven *eight*
# 4< eight
# 4> eight
# 3> se_ven eight
# 2> se_ven eight nine
# 1> one two th/ree null f*ur five six null se_ven eight nine *fail_
ok 2 - recursive replace
1..2
| [reply] [d/l] [select] |
|
|