rsriram has asked for the wisdom of the Perl Monks concerning the following question:

All,

I am writing a conversion program and in that, I want to convert all the <s1> to <section1> except if they are appearing within <sb>...</sb>. The text I need to convert will look like follows:

<s1>King of England JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in England to establish the colony of + Virginia <sb><s1>Rights of Landowners We, greatly commending, and graciously accepting of, their Desires for + the Furtherance</sb>

The first <s1> should be converted to <section1>. This is not a problem. But the second <s1>, because it is appearing inside <sb>...</sb>, it needs to be tagged as <h1>.

I used the following code to do this.

$_ =~ s/<s1>/<section1>/g; if ($_ =~ /<sb>/) { while($_ =~ /<\/sb>/) { $_ =~ s/<s1>/<h1>/g; } }

There is something incorrect in the usage of while statement. Can someone help me on this?

Replies are listed 'Best First'.
Re: Help in regex
by GrandFather (Saint) on Feb 07, 2007 at 07:24 UTC

    This looks very much like an HTML parsing issue and the conventional answer is "use an appropriate module". In this case I would recommend HTML::TreeBuilder. Consider the following:

    use strict; use warnings; use HTML::TreeBuilder; my $str = <<DOC; <s1>King of England JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in Engla +nd to establish the colony of Virginia <sb><s1>Rights of Landowners We, greatly commending, and graciously ac +cepting of, their Desires for the Furtherance</sb> DOC my $root = HTML::TreeBuilder->new (); $root->ignore_unknown (0); $root->parse ($str); $root->eof (); my @s1Nodes = $root->look_down ('_tag', 's1'); for my $node (@s1Nodes) { if ($node->look_up ('_tag', 'sb')) { # nested in a sb element - convert to h1 $node->{_tag} = 'h1'; } else { $node->{_tag} = 'section1'; } } print $root->as_HTML ();

    Prints:

    <html><head></head><body></body><section1>King of England JAMES, by th +e Grace of God, King of England, Scotland, France and Ireland, Defend +er of the Faith... ... This first selection was written in England to + establish the colony of Virginia <sb><h1>Rights of Landowners We, gr +eatly commending, and graciously accepting of, their Desires for the +Furtherance</h1></sb> </section1></html>

    DWIM is Perl's answer to Gödel
Re: Help in regex
by shmem (Chancellor) on Feb 07, 2007 at 08:36 UTC

    I'd apply two s/// conditionally. Assuming that the words going into section or h1 tags are all on a line:

    while(<DATA>) { s!(?<=<sb>)<s(\d+)>(.*)!<h$1>$2</h$1>! or s!<s(\d+)>(.*)!<section$1>$2</section$1>!; print } __DATA__ <s1>King of England JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in England to establish the colony of + Virginia <sb><s1>Rights of Landowners We, greatly commending, and graciously accepting of, their Desires for + the Furtherance</sb>

    outputs

    <section1>King of England</section1> JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in England to establish the colony of + Virginia <sb><h1>Rights of Landowners</h1> We, greatly commending, and graciously accepting of, their Desires for + the Furtherance</sb>

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Help in regex
by kyle (Abbot) on Feb 07, 2007 at 11:57 UTC
    The range operator (..) might do what you're looking for. For details, see Range Operators operator, range range .. ...
    while (<DATA>) { if ( m{<sb>} .. m{</sb>} ) { s/<s1>/<h1>/g; } else { s/<s1>/<section1>/g; } print; } __DATA__ <s1>King of England JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in England to establish the colony of + Virginia <sb><s1>Rights of Landowners We, greatly commending, and graciously accepting of, their Desires for + the Furtherance</sb> <s1>King of Perl This is only a test. <s1>King of Operators<sb><s1>Well, maybe not</sb>

    Output:

    <section1>King of England JAMES, by the Grace of God, King of England, Scotland, France and Irel +and, Defender of the Faith... ... This first selection was written in England to establish the colony of + Virginia <sb><h1>Rights of Landowners We, greatly commending, and graciously accepting of, their Desires for + the Furtherance</sb> <section1>King of Perl This is only a test. <h1>King of Operators<sb><h1>Well, maybe not</sb>

    My only concern is what happens if <sb> and </sb> appear on a line together or on a line with <s1> outside of them (see the last line of my example). Your sample doesn't have that case present, but your real text might. If so, you'll have to parse the text more deeply or preprocess it to remove the "problem" areas.