LexPl has asked for the wisdom of the Perl Monks concerning the following question:
I have got XML entities which sometimes follow directly after each other. In some cases, they are separated by a space, in other cases they aren't. Which of these alternatives will be correct, depends on the two entities involved.
You might find "... § 9 ...", but also "... – Übertragung ..."
The following script lists the matching regexes that were found in the input file. Ideally, I would get the occurrences of the entities matching a generic regex such as ;\s&. So for example §  or öß so that I might detect the existing combinations of entities. That's what I mean by "minimal occurrences".
I would also like to handle original files in a different manner than modified ones:
In original files, I'm looking for spaces between two entities, i.e. regex: ;\s&/, to verify, how many occurrences are in the original file so that I may check the modified files whether any of these spaces have been lost. Here it would nice to see which combinations exist in a given file.
In modified files, I use regexes to find issues of missing space that I have already recognized.
To switch between these two control flows, I have tried an
expression for different my @regexes, but that didn't work.if ($param eq 'mod') { my @regexes...} else { my @regexes...}
Here's my script I'm referring to:
!/usr/bin/perl use warnings; use strict; use diagnostics; print "Find pair of entities without/with separating space\n"; my $infile = $ARGV[0]; # put in comments what is not relevant # for modified files: currently known cases # my @regexes = (qr/–§/, qr/–Ü/, qr/ß§ +;/); # for original files: check for entities separated by space my @regexes = (qr/;\s&/); open my $in, '<', $infile or die "Cannot open $infile for reading: $!" +; #read input file in variable $xml my $xml; { local $/ = undef; $xml = <$in>; } #define output file open my $out, '>', 'pairs.txt' or die $!; print {$out} "Find pair of entities without/with separating space\n\ni +nput file: "; print {$out} "$infile"; print {$out} "\n====================================================== +==================\n\n"; for my $i (0 .. $#regexes) { my $regex = $regexes[$i]; $regex =~ s/^\(\?\^://; $regex =~ s/\)$//; print {$out} "$regex\n" while $xml =~ /$regex/g;; } close $in; close $out;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Output minimal occurrences of matching regex(es)
by choroba (Cardinal) on Nov 13, 2024 at 21:39 UTC | |
by LexPl (Beadle) on Nov 14, 2024 at 11:16 UTC | |
by hippo (Archbishop) on Nov 14, 2024 at 11:49 UTC |