LexPl has asked for the wisdom of the Perl Monks concerning the following question:
I have got a highly complex, nested XML document in encoding ISO-8859-1 which contains abbreviations.
Each abbreviation has two to three letters and each letter is directly followed by a full stop. The separator between each letter plus full stop might be
You could define this as a regex: a-zA-Z\.((!!!emsp14;|!!!hairsp;|\s)?a-zA-Z)+
I would like to wrap each abbreviation into an element <abbrev> and unify the separator whitespace to "!!!hairsp;"
This looks pretty easy, but there are some nasty pitfalls:
If two abbreviations are adjacent to each other, the problem of proper segmentation pops up. For example the string "a. A. z. B." could lead to <abbrev>a.!!!hairsp;A.!!!hairsp;z.</abbrev> which doesn't exist. The correct solution would be <abbrev>a.!!!hairsp;A.</abbrev>_<abbrev>z.!!!hairsp;B.</abbrev> where the underscore stands for a space.
Another issue is the full stop at the end of a sentence and a following abbreviation:
"Hier müssen die richtigen Regeln einbezogen werden. Z.B. ist hier § 42 ...". Of course, there exists no abbreviation "n. Z.B.", but the proper tagging would be: "Hier müssen die richtigen Regeln einbezogen werden. <abbrev>Z.!!!hairsp;B.</abbrev> ist hier § 42 ...".
As the regex captures abbreviations with 2 letters and with 3 letters, it has to be taken care that a 3 letter abbreviation such as "m.w.N." won't be split into a two letter abbreviation "m.w." followed by "N."
I suppose that you will need a kind of knowledge base in your script for the proper segmentation, but I don't know how to do that.
The easy solution would be a bunch of changes:
!/usr/bin/perl use warnings; use strict; # for interactive mode my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open(IN, '<' . $infile) or die $!; open(OUT, '>' . $outfile) or die $!; while(<IN>) { # wrap "a.A." $_ =~ s[a\.!!!hairsp;A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; $_ =~ s[a\.!!!emsp14;A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; + $_ =~ s[a\.\sA\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; $_ =~ s[a\.A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; # wrap "a.F." $_ =~ s[a\.!!!hairsp;F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; $_ =~ s[a\.!!!emsp14;F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; + $_ =~ s[a\.\sF\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; $_ =~ s[a\.F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; # wrap "d.h." $_ =~ s[d\.!!!hairsp;h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; $_ =~ s[d\.!!!emsp14;h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; + $_ =~ s[d\.\sh\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; $_ =~ s[d\.h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; # wrap "D.h." $_ =~ s[D\.!!!hairsp;h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; $_ =~ s[D\.!!!emsp14;h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; + $_ =~ s[D\.\sh\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; $_ =~ s[D\.h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; print OUT $_; } close(IN); close(OUT);
Do you see a more efficient solution? And if yes, could you kindly show me how this would look like?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: wrap abbreviations in XML element
by haukex (Archbishop) on May 16, 2025 at 10:33 UTC | |
by LexPl (Beadle) on May 16, 2025 at 12:10 UTC | |
by haukex (Archbishop) on May 16, 2025 at 12:48 UTC |