in reply to Help in regex
This looks very much like an HTML parsing issue and the conventional answer is "use an appropriate module". In this case I would recommend HTML::TreeBuilder. Consider the following:
use strict; use warnings; use HTML::TreeBuilder; my $str = <<DOC; <s1>King of England JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in Engla +nd to establish the colony of Virginia <sb><s1>Rights of Landowners We, greatly commending, and graciously ac +cepting of, their Desires for the Furtherance</sb> DOC my $root = HTML::TreeBuilder->new (); $root->ignore_unknown (0); $root->parse ($str); $root->eof (); my @s1Nodes = $root->look_down ('_tag', 's1'); for my $node (@s1Nodes) { if ($node->look_up ('_tag', 'sb')) { # nested in a sb element - convert to h1 $node->{_tag} = 'h1'; } else { $node->{_tag} = 'section1'; } } print $root->as_HTML ();
Prints:
<html><head></head><body></body><section1>King of England JAMES, by th +e Grace of God, King of England, Scotland, France and Ireland, Defend +er of the Faith... ... This first selection was written in England to + establish the colony of Virginia <sb><h1>Rights of Landowners We, gr +eatly commending, and graciously accepting of, their Desires for the +Furtherance</h1></sb> </section1></html>
|
|---|