in reply to wrap abbreviations in XML element
You have two separate problems:
For 1, here is some code that will walk all text nodes in an XML document, match a regex against that text, and wrap the matching text in a new tag. (Update: You didn't specify what should happen if there's a node in the match, e.g. "Z.<br/>B.", so the code below currently doesn't handle that case.)
use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml(string => <<'XML'); <foo>hello, foobar <bar x="foo">world! foo</bar> fo<quz/>oofoo</foo> XML my $re = qr/foo/; # don't include any capturing parens in this! my @nodes = $doc->documentElement; while (my $node = pop @nodes) { for my $c ($node->childNodes) { if ($c->nodeType==XML_ELEMENT_NODE) { push @nodes, $c } elsif ($c->nodeType==XML_TEXT_NODE || $c->nodeType==XML_CDATA_ +SECTION_NODE) { my @parts = split /($re)/, $c->data; next unless @parts>1; my $d = $doc->createDocumentFragment; for my $i (0..$#parts) { if ($i%2) { # regex match my $e = $doc->createElement('x'); $e->appendText($parts[$i]); $d->appendChild($e); } else { # text around match $d->appendText($parts[$i]); } } $node->replaceChild($d, $c); } } } print $doc->toString; __END__ <?xml version="1.0"?> <foo>hello, <x>foo</x>bar <bar x="foo">world! <x>foo</x></bar> fo<quz/ +>oo<x>foo</x></foo>
For 2, I would recommend that just for the sake of explaining what you want to match and don't match, simplify the separators like !!!hairsp; into something shorter, like a single !, just to make things easier for us to read. The pitfalls you describe are indeed complicated, and in the end you might need to end up using some existing data like (just for example) this List of German abbreviations. Update: My node Building Regex Alternations Dynamically might be useful in that regard.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: wrap abbreviations in XML element
by LexPl (Beadle) on May 16, 2025 at 12:10 UTC | |
by haukex (Archbishop) on May 16, 2025 at 12:48 UTC |