You have two separate problems:

  1. Do not use regular expressions to parse and manipulate XML or HTML, ever. Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks
  2. To help with your regex, we need a lot more examples of strings that should match and shouldn't match. See e.g. Re: How to ask better questions using Test::More and sample data

For 1, here is some code that will walk all text nodes in an XML document, match a regex against that text, and wrap the matching text in a new tag. (Update: You didn't specify what should happen if there's a node in the match, e.g. "Z.<br/>B.", so the code below currently doesn't handle that case.)

use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml(string => <<'XML'); <foo>hello, foobar <bar x="foo">world! foo</bar> fo<quz/>oofoo</foo> XML my $re = qr/foo/; # don't include any capturing parens in this! my @nodes = $doc->documentElement; while (my $node = pop @nodes) { for my $c ($node->childNodes) { if ($c->nodeType==XML_ELEMENT_NODE) { push @nodes, $c } elsif ($c->nodeType==XML_TEXT_NODE || $c->nodeType==XML_CDATA_ +SECTION_NODE) { my @parts = split /($re)/, $c->data; next unless @parts>1; my $d = $doc->createDocumentFragment; for my $i (0..$#parts) { if ($i%2) { # regex match my $e = $doc->createElement('x'); $e->appendText($parts[$i]); $d->appendChild($e); } else { # text around match $d->appendText($parts[$i]); } } $node->replaceChild($d, $c); } } } print $doc->toString; __END__ <?xml version="1.0"?> <foo>hello, <x>foo</x>bar <bar x="foo">world! <x>foo</x></bar> fo<quz/ +>oo<x>foo</x></foo>

For 2, I would recommend that just for the sake of explaining what you want to match and don't match, simplify the separators like !!!hairsp; into something shorter, like a single !, just to make things easier for us to read. The pitfalls you describe are indeed complicated, and in the end you might need to end up using some existing data like (just for example) this List of German abbreviations. Update: My node Building Regex Alternations Dynamically might be useful in that regard.


In reply to Re: wrap abbreviations in XML element by haukex
in thread wrap abbreviations in XML element by LexPl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.