LexPl has asked for the wisdom of the Perl Monks concerning the following question:

I have got a highly complex, nested XML document in encoding ISO-8859-1 which contains abbreviations.

Each abbreviation has two to three letters and each letter is directly followed by a full stop. The separator between each letter plus full stop might be

You could define this as a regex: a-zA-Z\.((!!!emsp14;|!!!hairsp;|\s)?a-zA-Z)+

I would like to wrap each abbreviation into an element <abbrev> and unify the separator whitespace to "!!!hairsp;"

This looks pretty easy, but there are some nasty pitfalls:
If two abbreviations are adjacent to each other, the problem of proper segmentation pops up. For example the string "a. A. z. B." could lead to <abbrev>a.!!!hairsp;A.!!!hairsp;z.</abbrev> which doesn't exist. The correct solution would be <abbrev>a.!!!hairsp;A.</abbrev>_<abbrev>z.!!!hairsp;B.</abbrev> where the underscore stands for a space.

Another issue is the full stop at the end of a sentence and a following abbreviation:
"Hier müssen die richtigen Regeln einbezogen werden. Z.B. ist hier § 42 ...". Of course, there exists no abbreviation "n. Z.B.", but the proper tagging would be: "Hier müssen die richtigen Regeln einbezogen werden. <abbrev>Z.!!!hairsp;B.</abbrev> ist hier § 42 ...".

As the regex captures abbreviations with 2 letters and with 3 letters, it has to be taken care that a 3 letter abbreviation such as "m.w.N." won't be split into a two letter abbreviation "m.w." followed by "N."

I suppose that you will need a kind of knowledge base in your script for the proper segmentation, but I don't know how to do that.

The easy solution would be a bunch of changes:

!/usr/bin/perl use warnings; use strict; # for interactive mode my $infile = $ARGV[0]; my $outfile = $ARGV[1]; open(IN, '<' . $infile) or die $!; open(OUT, '>' . $outfile) or die $!; while(<IN>) { # wrap "a.A." $_ =~ s[a\.!!!hairsp;A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; $_ =~ s[a\.!!!emsp14;A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; + $_ =~ s[a\.\sA\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; $_ =~ s[a\.A\.](<abbrev n='2'>a.!!!hairsp;A.</abbrev>)g; # wrap "a.F." $_ =~ s[a\.!!!hairsp;F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; $_ =~ s[a\.!!!emsp14;F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; + $_ =~ s[a\.\sF\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; $_ =~ s[a\.F\.](<abbrev n='2'>a.!!!hairsp;F.</abbrev>)g; # wrap "d.h." $_ =~ s[d\.!!!hairsp;h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; $_ =~ s[d\.!!!emsp14;h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; + $_ =~ s[d\.\sh\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; $_ =~ s[d\.h\.](<abbrev n='2'>d.!!!hairsp;h.</abbrev>)g; # wrap "D.h." $_ =~ s[D\.!!!hairsp;h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; $_ =~ s[D\.!!!emsp14;h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; + $_ =~ s[D\.\sh\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; $_ =~ s[D\.h\.](<abbrev n='2'>D.!!!hairsp;h.</abbrev>)g; print OUT $_; } close(IN); close(OUT);

Do you see a more efficient solution? And if yes, could you kindly show me how this would look like?

Replies are listed 'Best First'.
Re: wrap abbreviations in XML element
by haukex (Archbishop) on May 16, 2025 at 10:33 UTC

    You have two separate problems:

    1. Do not use regular expressions to parse and manipulate XML or HTML, ever. Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks
    2. To help with your regex, we need a lot more examples of strings that should match and shouldn't match. See e.g. Re: How to ask better questions using Test::More and sample data

    For 1, here is some code that will walk all text nodes in an XML document, match a regex against that text, and wrap the matching text in a new tag. (Update: You didn't specify what should happen if there's a node in the match, e.g. "Z.<br/>B.", so the code below currently doesn't handle that case.)

    use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml(string => <<'XML'); <foo>hello, foobar <bar x="foo">world! foo</bar> fo<quz/>oofoo</foo> XML my $re = qr/foo/; # don't include any capturing parens in this! my @nodes = $doc->documentElement; while (my $node = pop @nodes) { for my $c ($node->childNodes) { if ($c->nodeType==XML_ELEMENT_NODE) { push @nodes, $c } elsif ($c->nodeType==XML_TEXT_NODE || $c->nodeType==XML_CDATA_ +SECTION_NODE) { my @parts = split /($re)/, $c->data; next unless @parts>1; my $d = $doc->createDocumentFragment; for my $i (0..$#parts) { if ($i%2) { # regex match my $e = $doc->createElement('x'); $e->appendText($parts[$i]); $d->appendChild($e); } else { # text around match $d->appendText($parts[$i]); } } $node->replaceChild($d, $c); } } } print $doc->toString; __END__ <?xml version="1.0"?> <foo>hello, <x>foo</x>bar <bar x="foo">world! <x>foo</x></bar> fo<quz/ +>oo<x>foo</x></foo>

    For 2, I would recommend that just for the sake of explaining what you want to match and don't match, simplify the separators like !!!hairsp; into something shorter, like a single !, just to make things easier for us to read. The pitfalls you describe are indeed complicated, and in the end you might need to end up using some existing data like (just for example) this List of German abbreviations. Update: My node Building Regex Alternations Dynamically might be useful in that regard.

      many thanks for your detailed feedback which I will consume bit by bit :)

      In a short test with slight adaptations to your script:

      1. my $doc = XML::LibXML->load_xml(string => <<'XML'); <foo>hello, foobar <bar x="foo">world! a.!!!emsp14;A.</bar> fo<quz/>oo +a.!!!hairsp;A.</foo> XML
      2. my $re = qr/a\.(!!!emsp14;|!!!hairsp;)A\./;
      3. my $e = $doc->createElement('abbrev');

      I got the following output:

      <?xml version="1.0"?> <foo>hello, foobar <bar x="foo">world! <abbrev>a.!!!emsp14;A.</abbrev> +!!!emsp14;</bar> fo<quz/>oo<abbrev>a.!!!hairsp;A.</abbrev>!!!hairsp;< +/foo>

      As you will see, the separating whitespace is repeated after the "abbrev" element. Why do I use "!!!emsp14;" for the entity &emsp14;? I want to prevent that such entities will be resolved by an XML parser and I want to manipulate my data independently from a DTD.
        As you will see, the separating whitespace is repeated after the "abbrev" element.

        Yes, that's because of the behavior of split when there are multiple capturing groups present in the regular expression, that's why I wrote "don't include any capturing parens in this!". So if you change the capturing group in your regular expression into a non-capturing group, it'll work as expected: qr/a\.(?:!!!emsp14;|!!!hairsp;)A\./ See also perlretut.