paschacroutt has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

What follows is my very first adventure in perl world, so please be merciful.

I wrote the following code to modify a bunch of pandoc-generated DokuWiki formatted files in order to convert some expressions to DokuWiki internal links. That went through many iterations and I am now somewhat satisfied with the result except for one mystery I am unable to pierce through.
It does not crash my code, nor does it sends any warning, but I can't explain this result.

Here's my problem:

In this substitution part of the code: .(defined($4) ? $4 =~ tr[\/*][]dr : '.') line 44, I check if the value of $4 is defined and if it is, I apply a tr to it to remove italic (//) or bold (**) marks if they are present. If $4 is undefined (meaning there are no dot, comma or semicolon after the last word of the expression), the conditional operator sends a dot to end the substitution.

So either this data entry:

**Voir :** proton, solution hydrogénée//.//
or that one:
**Voir :** proton, solution hydrogénée.
Give this result:
**Voir :** [[glossaire:entrees:p:proton|proton]], [[glossaire:entrees: +s:solution_hydrogenee|solution hydrogénée]].
Which is exactly what is needed.
But if there is no final dot or comma after solution hydrogénée:
**Voir :** proton, solution hydrogénée
What I get is:
**Voir :** [[glossaire:entrees:p:proton|proton]], [[glossaire:entrees: +s:solution_hydrogenee |solution hydrogénée ]].
With line breaks after solution_hydrogene and hydrogénée
Which is ok but I was expecting:
**Voir :** [[glossaire:entrees:p:proton|proton]], [[glossaire:entrees: +s:solution_hydrogenee|solution hydrogénée]].
No line breaks.
And I don't understand why and how those line breaks get inserted.
Actually, those breaks don't really matter, as DokuWiki format being some kind of lesser markdown, it just gives the same html output with or without them, but it worries me as a sign of my incomplete understanding of my own code.

It may have something to do with the /x modifier I suspect.

The actual code follows


1 #!/usr/bin/env perl 2 3 use 5.36.1; 4 use warnings; 5 use strict; 6 use utf8; 7 use autodie; 8 9 use warnings qw< FATAL utf8 >; 10 use open qw< :std :utf8 >; 11 use charnames qw< :full >; 12 use feature qw< unicode_strings >; 13 14 binmode(STDIN, ":utf8"); 15 binmode(STDOUT, ":utf8"); 16 binmode(STDERR, ":utf8"); 17 18 use Text::Undiacritic qw(undiacritic); 19 20 $^I = ".bak"; 21 22 while (<>){ 23 24 my $voir = $_; 25 26 $voir =~ s/ 27 (?:^\*\*Voir\s:\*\* 28 | 29 \G(?!^) 30 (?!\[)) 31 \K 32 (\s?) 33 ((\w[\/*]*) 34 (?:[^\.,;\n\r]\s?)+) 35 [\/*]*([\.,;])?[\/*]* 36 / 37 "$1\[\[glossaire:entrees:" 38 .lc(undiacritic($3)) 39 .":" 40 .lc(undiacritic($2 =~ tr[ \/*][_]dr)) 41 ."|" 42 .$2 =~ tr[\/*][]dr 43 ."\]\]" 44 .(defined($4) ? $4 =~ tr[\/*][]dr : '.') 45 /gemx; 46 47 print $voir; 48 }

Some context

  1. That code's job is to go through some 2500+ files, some of them several hundreds lines long, find lines beginning with **Voir :** and insert dokuWiki links in them,
  2. I use Text::Undiacritic because those text files contain accentuated characters I need to replace with their unaccentuated vesions to build file names,
  3. The Regex came first, then I searched for the best tool to use it through many files and that's how I ended up using perl.

Thanks for reading through !

Replies are listed 'Best First'.
Re: Unexpected line breaks in substitution results
by tybalt89 (Monsignor) on Apr 10, 2024 at 15:47 UTC

    It has to do with the \s.

    #!/usr/bin/perl use 5.36.1; use warnings; use strict; use utf8; use autodie; use warnings qw< FATAL utf8 >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); use Text::Undiacritic qw(undiacritic); $^I = ".bak"; my $voir = '**Voir :** proton, solution hydrogénée' . "\n"; $voir =~ s/ (?:^\*\*Voir\s:\*\* | \G(?!^) (?!\[)) \K (\s?) ((\w[\/*]*) (?:[^\.,;\n\r]\s?)+) [\/*]*([\.,;])?[\/*]* / "$1\[\[glossaire:entrees:" .lc(undiacritic($3)) .":" .lc(undiacritic($2 =~ tr[ \/*][_]dr)) ."|" .$2 =~ tr[\/*][]dr ."\]\]" .(defined($4) ? $4 =~ tr[\/*][]dr : '.') /gemx; print $voir, "\n";

    shows the line breaks, and

    #!/usr/bin/perl use 5.36.1; use warnings; use strict; use utf8; use autodie; use warnings qw< FATAL utf8 >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); use Text::Undiacritic qw(undiacritic); $^I = ".bak"; my $voir = '**Voir :** proton, solution hydrogénée' . "\n"; $voir =~ s/ (?:^\*\*Voir\s:\*\* | \G(?!^) (?!\[)) \K (\s?) ((\w[\/*]*) (?:[^\.,;\n\r]\h?)+) [\/*]*([\.,;])?[\/*]* / "$1\[\[glossaire:entrees:" .lc(undiacritic($3)) .":" .lc(undiacritic($2 =~ tr[ \/*][_]dr)) ."|" .$2 =~ tr[\/*][]dr ."\]\]" .(defined($4) ? $4 =~ tr[\/*][]dr : '.') /gemx; print $voir, "\n";

    doesn't.

    The \s is grabbing the new line => simple solution => chomp the input right after the while.

      Hey ! Thanks tybalt89.

      Indeed, when I uncommented the chomp I had originally included but then removed, I don't remember why, unwanted line-breaks disappeared. Still don't understand why they were appearing only when there was no final dot but I guess it will be clear some day.
Re: Unexpected line breaks in substitution results
by Danny (Chaplain) on Apr 10, 2024 at 15:12 UTC
    This
    #!/usr/bin/env perl use 5.36.1; use warnings; use strict; use utf8; use autodie; use warnings qw< FATAL utf8 >; use open qw< :std :utf8 >; use charnames qw< :full >; use feature qw< unicode_strings >; binmode(STDIN, ":utf8"); binmode(STDOUT, ":utf8"); binmode(STDERR, ":utf8"); use Text::Undiacritic qw(undiacritic); $^I = ".bak"; my $voir = '**Voir :** proton, solution hydrogénée'; $voir =~ s/ (?:^\*\*Voir\s:\*\* | \G(?!^) (?!\[)) \K (\s?) ((\w[\/*]*) (?:[^\.,;\n\r]\s?)+) [\/*]*([\.,;])?[\/*]* / "$1\[\[glossaire:entrees:" .lc(undiacritic($3)) .":" .lc(undiacritic($2 =~ tr[ \/*][_]dr)) ."|" .$2 =~ tr[\/*][]dr ."\]\]" .(defined($4) ? $4 =~ tr[\/*][]dr : '.') /gemx; print $voir, "\n";
    gives me
    **Voir :** [[glossaire:entrees:p:proton|proton]], [[glossaire:entrees: +s:solution_hydrogenee|solution hydrogénée]].
    So not seeing the line break issue.