I have got a highly complex, nested XML document in encoding ISO-8859-1 which contains abbreviations.

Each abbreviation has two to three letters and each letter is directly followed by a full stop. The separator between each letter plus full stop might be

You could define this as a regex: a-zA-Z\.((!!!emsp14;|!!!hairsp;|\s)?a-zA-Z)+

I would like to wrap each abbreviation into an element <abbrev> and unify the separator whitespace to "!!!hairsp;"

This looks pretty easy, but there are some nasty pitfalls:
If two abbreviations are adjacent to each other, the problem of proper segmentation pops up. For example the string "a. A. z. B." could lead to <abbrev>a.!!!hairsp;A.!!!hairsp;z.</abbrev> which doesn't exist. The correct solution would be <abbrev>a.!!!hairsp;A.</abbrev>_<abbrev>z.!!!hairsp;B.</abbrev> where the underscore stands for a space.

Another issue is the full stop at the end of a sentence and a following abbreviation:
"Hier müssen die richtigen Regeln einbezogen werden. Z.B. ist hier § 42 ...". Of course, there exists no abbreviation "n. Z.B.", but the proper tagging would be: "Hier müssen die richtigen Regeln einbezogen werden. <abbrev>Z.!!!hairsp;B.</abbrev> ist hier § 42 ...".

As the regex captures abbreviations with 2 letters and with 3 letters, it has to be taken care that a 3 letter abbreviation such as "m.w.N." won't be split into a two letter abbreviation "m.w." followed by "N."

I suppose that you will need a kind of knowledge base in your script for the proper segmentation, but I don't know how to do that.

The easy solution would be a bunch of changes:

!/usr/bin/perl use warnings; use strict; # for interactive mode my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open(IN, '<' . $infile) or die $!; open(OUT, '>' . $outfile) or die $!; while(<IN>) { # wrap "a.A." $_ =~ s[a\.!!!hairsp;A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; $_ =~ s[a\.!!!emsp14;A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; + $_ =~ s[a\.\sA\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; $_ =~ s[a\.A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; # wrap "a.F." $_ =~ s[a\.!!!hairsp;F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; $_ =~ s[a\.!!!emsp14;F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; + $_ =~ s[a\.\sF\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; $_ =~ s[a\.F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; # wrap "d.h." $_ =~ s[d\.!!!hairsp;h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; $_ =~ s[d\.!!!emsp14;h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; + $_ =~ s[d\.\sh\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; $_ =~ s[d\.h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; # wrap "D.h." $_ =~ s[D\.!!!hairsp;h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; $_ =~ s[D\.!!!emsp14;h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; + $_ =~ s[D\.\sh\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; $_ =~ s[D\.h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; print OUT $_; } close(IN); close(OUT);

Do you see a more efficient solution? And if yes, could you kindly show me how this would look like?


In reply to wrap abbreviations in XML element by LexPl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.