Output minimal occurrences of matching regex(es)

LexPl has asked for the wisdom of the Perl Monks concerning the following question:

I have got XML entities which sometimes follow directly after each other. In some cases, they are separated by a space, in other cases they aren't. Which of these alternatives will be correct, depends on the two entities involved.

You might find "... §&emsp14;9 ...", but also "... – Übertragung ..."

The following script lists the matching regexes that were found in the input file. Ideally, I would get the occurrences of the entities matching a generic regex such as ;\s&. So for example §&emsp14; or öß so that I might detect the existing combinations of entities. That's what I mean by "minimal occurrences".

I would also like to handle original files in a different manner than modified ones:

In original files, I'm looking for spaces between two entities, i.e. regex: ;\s&/, to verify, how many occurrences are in the original file so that I may check the modified files whether any of these spaces have been lost. Here it would nice to see which combinations exist in a given file.

In modified files, I use regexes to find issues of missing space that I have already recognized.

To switch between these two control flows, I have tried an

if ($param eq 'mod') {
          my @regexes...}
      else {
          my @regexes...}
[download]

expression for different my @regexes, but that didn't work.

Here's my script I'm referring to:

!/usr/bin/perl
use warnings;
use strict;
use diagnostics;

print "Find pair of entities without/with separating space\n";

my $infile = $ARGV[0];

# put in comments what is not relevant
# for modified files: currently known cases
# my @regexes = (qr/&ndash;&sect;/, qr/&ndash;&Uuml;/, qr/&szlig;&sect
+;/);
# for original files: check for entities separated by space
my @regexes = (qr/;\s&/);

open my $in, '<', $infile or die "Cannot open $infile for reading: $!"
+;

#read input file in variable $xml
my $xml;
{
  local $/ = undef;
  $xml = <$in>;
}

#define output file
open my $out, '>', 'pairs.txt' or die $!;

print {$out} "Find pair of entities without/with separating space\n\ni
+nput file: ";
print {$out} "$infile";
print {$out} "\n======================================================
+==================\n\n";

for my $i (0 .. $#regexes) {
    my $regex = $regexes[$i];
    $regex =~ s/^\(\?\^://;
    $regex =~ s/\)$//;
    print {$out} "$regex\n" while $xml =~ /$regex/g;;   
 }
 
close $in;
close $out;
[download]

Comment on Output minimal occurrences of matching regex(es) Select or Download Code

Replies are listed 'Best First'.
Re: Output minimal occurrences of matching regex(es) by choroba (Cardinal) on Nov 13, 2024 at 21:39 UTC
You cannot declare the two @regexes arrays in different scopes and then use them later. A variable declared with my only lasts up to the end of the enclosing block. You can declare the variable before the condition and only populate it depending on it: `my @regexes; if ($param eq 'mod') { @regexes = ... } else { @regexes = ... }` [download] or you can use the conditional operator (the "ternary"), shown below. To match the entities involved in `;\s&`, you need to extend the regex: `&[^;]+;\s&[^;]+;` [download] i.e. an ampersand, not semicolon at least once, semicolon, space, and an entity again. Here's a complete example. Running it with `script.pl 1.xml` searches for the original regexes, specifyin any true parameter (e.g. `script.pl 1.xml 1`) after the file uses the modified regexes. I also added the grouping parentheses to the matching and $1 to the output to show which part of the string actually matched the regex. #!/usr/bin/perl use warnings; use strict; my ($infile, $param) = @ARGV; my @regexes = $param ? (qr/&[^;]+;\s&[^;]+;/) : (qr/–§/, qr/–Ü/, qr/ß§/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; my $xml; { local $/; $xml = <$in>; } open my $out, '>', 'pairs.txt' or die $!; print {$out} "Find pair of entities without/with separating space\n\ni +nput file: "; print {$out} "$infile"; print {$out} "\n====================================================== +==================\n\n"; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^$\?\^://; $regex =~ s/$$//; print {$out} "$regex: $1\n" while $xml =~ /($regex)/g; } close $in; close $out; [download] Update: Fixed formatting, thanks Discipulus. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Output minimal occurrences of matching regex(es) by LexPl (Beadle) on Nov 14, 2024 at 11:16 UTC
First of all, many thanks for your valuable input! I learn a lot from your comments :-) I understood that `my ($infile, $param) = @ARGV;` is a smart way to read two input parameters from the command line. What happens if the second parameter is missing, i.e. perl entityPairs.pl input.xml? Will that mean that the condition in ternary operator expression evaluates to false and the values after ":" are taken? I've got two issues: The output only shows `$regex`, but not `$1` (the current value), so for example `&[^;]+;\s&[^;]+;:` for each occurrence in the original file and for example `–§:` in the modified file. There is a warning with respect to the line that prints the occurrences. `print {$out} "$regex: $1\n" while $xml =~ /$regex/g;` Use of uninitialized value $1 in concatenation (.) or string at entityPairs.pl line 42, <$in> line 1 (#1) (W uninitialized) An undefined value was used as if it were already defined. It was interpreted as a "" or a 0, but maybe it was a mistake. To suppress this warning assign a defined value to your variables. To help you figure out what was undefined, perl will try to tell you the name of the variable (if any) that was undefined. In some cases it cannot do this, so it also tells you what operation you used the undefined value in. Note, however, that perl optimizes your program and the operation displayed in the warning may not necessarily appear literally in your program. For example, "that $foo" is usually optimized into "that " . $foo, and the warning will refer to the concatenation (.) operator, even though there is no . in your program. FYI, the current version of the script: #!/usr/bin/perl use warnings; use strict; use diagnostics; print "Find pair of entities without/with separating space\n"; # read input file and param ('mod' = for modified files) my ($infile, $param) = @ARGV; my @regexes = $param ? (qr/–§/, qr/–Ü/, qr/ß§/) : (qr/&[^;]+;\s&[^;]+;/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; #read input file in variable $xml my $xml; { local $/ = undef; $xml = <$in>; } #define output file open my $out, '>', 'pairs.txt' or die $!; #output statistics print {$out} "Find pair of entities without/with separating space\n\ni +nput file: "; print {$out} "$infile"; print {$out} "\n====================================================== +==================\n\n"; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^$\?\^://; $regex =~ s/$$//; print {$out} "$regex: $1\n" while $xml =~ /$regex/g; } close $in; close $out; [download]	[reply] [d/l] [select]
Re^3: Output minimal occurrences of matching regex(es) by hippo (Archbishop) on Nov 14, 2024 at 11:49 UTC
I've got two issues: Actually, you have only one issue with 2 symptoms. Both symptoms are because you are not capturing the results of the match. Use brackets to do that. eg: `$ perl -we '$regex = qr/foo/; print "$regex: $1\n" while "foob" =~ /$r +egex/g;' Use of uninitialized value $1 in concatenation (.) or string at -e lin +e 1. (?^:foo): $ perl -we '$regex = qr/(foo)/; print "$regex: $1\n" while "foob" =~ / +$regex/g;' (?^:(foo)): foo $` [download] 🦛	[reply] [d/l]