Re: wrap abbreviations in XML element

You have two separate problems:

Do not use regular expressions to parse and manipulate XML or HTML, ever. Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks
To help with your regex, we need a lot more examples of strings that should match and shouldn't match. See e.g. Re: How to ask better questions using Test::More and sample data

For 1, here is some code that will walk all text nodes in an XML document, match a regex against that text, and wrap the matching text in a new tag. (Update: You didn't specify what should happen if there's a node in the match, e.g. "Z.<br/>B.", so the code below currently doesn't handle that case.)

use warnings;
use strict;
use XML::LibXML;

my $doc = XML::LibXML->load_xml(string => <<'XML');
<foo>hello, foobar <bar x="foo">world! foo</bar> fo<quz/>oofoo</foo>
XML

my $re = qr/foo/; # don't include any capturing parens in this!

my @nodes = $doc->documentElement;
while (my $node = pop @nodes) {
    for my $c ($node->childNodes) {
        if ($c->nodeType==XML_ELEMENT_NODE) { push @nodes, $c }
        elsif ($c->nodeType==XML_TEXT_NODE || $c->nodeType==XML_CDATA_
+SECTION_NODE) {
            my @parts = split /($re)/, $c->data;
            next unless @parts>1;
            my $d = $doc->createDocumentFragment;
            for my $i (0..$#parts) {
                if ($i%2) {  # regex match
                    my $e = $doc->createElement('x');
                    $e->appendText($parts[$i]);
                    $d->appendChild($e);
                }
                else {  # text around match
                    $d->appendText($parts[$i]);
                }
            }
            $node->replaceChild($d, $c);
        }
    }
}

print $doc->toString;

__END__
<?xml version="1.0"?>
<foo>hello, <x>foo</x>bar <bar x="foo">world! <x>foo</x></bar> fo<quz/
+>oo<x>foo</x></foo>
[download]

For 2, I would recommend that just for the sake of explaining what you want to match and don't match, simplify the separators like !!!hairsp; into something shorter, like a single !, just to make things easier for us to read. The pitfalls you describe are indeed complicated, and in the end you might need to end up using some existing data like (just for example) this List of German abbreviations. Update: My node Building Regex Alternations Dynamically might be useful in that regard.

Comment on Re: wrap abbreviations in XML element Select or Download Code

Replies are listed 'Best First'.
Re^2: wrap abbreviations in XML element by LexPl (Beadle) on May 16, 2025 at 12:10 UTC
many thanks for your detailed feedback which I will consume bit by bit :) In a short test with slight adaptations to your script: `my $doc = XML::LibXML->load_xml(string => <<'XML'); <foo>hello, foobar <bar x="foo">world! a.!!!emsp14;A.</bar> fo<quz/>oo +a.!!!hairsp;A.</foo> XML` [download] `my $re = qr/a\.(!!!emsp14;\|!!!hairsp;)A\./;` `my $e = $doc->createElement('abbrev');` I got the following output: `<?xml version="1.0"?> <foo>hello, foobar <bar x="foo">world! <abbrev>a.!!!emsp14;A.</abbrev> +!!!emsp14;</bar> fo<quz/>oo<abbrev>a.!!!hairsp;A.</abbrev>!!!hairsp;< +/foo>` [download] As you will see, the separating whitespace is repeated after the "abbrev" element. Why do I use "!!!emsp14;" for the entity `&emsp14;`? I want to prevent that such entities will be resolved by an XML parser and I want to manipulate my data independently from a DTD.	[reply] [d/l] [select]
Re^3: wrap abbreviations in XML element by haukex (Archbishop) on May 16, 2025 at 12:48 UTC
As you will see, the separating whitespace is repeated after the "abbrev" element. Yes, that's because of the behavior of split when there are multiple capturing groups present in the regular expression, that's why I wrote "`don't include any capturing parens in this!`". So if you change the capturing group in your regular expression into a non-capturing group, it'll work as expected: `qr/a\.(?:!!!emsp14;\|!!!hairsp;)A\./` See also perlretut.	[reply] [d/l]