in reply to Re: wrap abbreviations in XML element
in thread wrap abbreviations in XML element

many thanks for your detailed feedback which I will consume bit by bit :)

In a short test with slight adaptations to your script:

  1. my $doc = XML::LibXML->load_xml(string => <<'XML'); <foo>hello, foobar <bar x="foo">world! a.!!!emsp14;A.</bar> fo<quz/>oo +a.!!!hairsp;A.</foo> XML
  2. my $re = qr/a\.(!!!emsp14;|!!!hairsp;)A\./;
  3. my $e = $doc->createElement('abbrev');

I got the following output:

<?xml version="1.0"?> <foo>hello, foobar <bar x="foo">world! <abbrev>a.!!!emsp14;A.</abbrev> +!!!emsp14;</bar> fo<quz/>oo<abbrev>a.!!!hairsp;A.</abbrev>!!!hairsp;< +/foo>

As you will see, the separating whitespace is repeated after the "abbrev" element. Why do I use "!!!emsp14;" for the entity &emsp14;? I want to prevent that such entities will be resolved by an XML parser and I want to manipulate my data independently from a DTD.

Replies are listed 'Best First'.
Re^3: wrap abbreviations in XML element
by haukex (Archbishop) on May 16, 2025 at 12:48 UTC
    As you will see, the separating whitespace is repeated after the "abbrev" element.

    Yes, that's because of the behavior of split when there are multiple capturing groups present in the regular expression, that's why I wrote "don't include any capturing parens in this!". So if you change the capturing group in your regular expression into a non-capturing group, it'll work as expected: qr/a\.(?:!!!emsp14;|!!!hairsp;)A\./ See also perlretut.