corfuitl has asked for the wisdom of the Perl Monks concerning the following question:
Hi
I am trying to implement a function to split a sentence with XML tags in 3 parts. To be more clear, I want to parse a sentence with XML tags so that I will not have the wrapping tags. Input sentences are:
<d id="43">Text </d> here <a id="33"/> <b id="33"/> Text <d id="43">text</d> here <d id="43">text here</d> <d id="43">text here</d> <d id="44">text here</d>
Output should be
start: "", middle: "<d id="43">Text </d> here", end: " <a id="33"/>" start: "<b id="33"/> ", middle: "Text <d id="43">text</d> here", end: +"" start: "<d id="43">", middle: "text here", end: "</d>" start: "", middle: "<d id="43">text here</d> <d id="44">text here</d>" +, end: ""
I started my code but I don't think it's efficient. Any suggestions? a, b and c tags are self tags while d is always paired.
my $segment = $_; my $start =""; my $end =""; my $middle =""; while ($segment =~ /^(<[a|b|c] id=\".*?\"\/>)/ || $segment =~ /^(\ +s+)/){ $start .= $1; $segment =~ s/^\Q$1\E//; } while ($segment =~ /(\s+)$/ || $segment =~ /(<[a|b|c] id=\".*?\"\/ +>)$/){ $end = "$1$end"; $segment =~ s/\Q$1\E$//; } while ($segment =~ /^(<d id=\".*?\">).*?(<\/d>)/){ ---- } print "start: \"$start\", middle: \"$middle\", end: \"$end\"\n";
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Getting start and end xml tags
by Corion (Patriarch) on May 28, 2020 at 12:41 UTC | |
by corfuitl (Sexton) on May 28, 2020 at 13:00 UTC | |
by Corion (Patriarch) on May 28, 2020 at 13:05 UTC | |
by perlfan (Parson) on May 28, 2020 at 13:35 UTC | |
by haukex (Archbishop) on May 28, 2020 at 13:42 UTC | |
by perlfan (Parson) on May 28, 2020 at 14:07 UTC | |
| |
|
Re: Getting start and end xml tags
by marto (Cardinal) on May 28, 2020 at 12:45 UTC | |
by corfuitl (Sexton) on May 28, 2020 at 12:58 UTC | |
by marto (Cardinal) on May 28, 2020 at 13:14 UTC | |
by Fletch (Bishop) on May 28, 2020 at 13:20 UTC | |
|
Re: Getting start and end xml tags
by haukex (Archbishop) on May 28, 2020 at 14:20 UTC |