Re: Split tags and words nicely

You can split on the boundary where tags either start or end by using look-behind and -ahead assertions. That is, look for where a tag stops and text starts or where text stops and a tag starts. This script runs with the -l flag to save having to print newlines explicitly.

#!/usr/local/bin/perl -l
#
use strict;
use warnings;

my $rxSplit = qr
   {(?x)
       (?<=[^<])
       (?=[<])
       |
       (?<=[>])
       (?=[^<])
   };

my $html =
   q{<tag ref=1>Start<tag ref=2>and more</tag>and end</tag>};

my @elems = split m{$rxSplit}, $html;

print for @elems;
[download]

And the output.

<tag ref=1>
Start
<tag ref=2>
and more
</tag>
and end
</tag>
[download]

I hope this is of use.

Cheers,

JohnGG

Comment on Re: Split tags and words nicely Select or Download Code

Replies are listed 'Best First'.
Re^2: Split tags and words nicely by ww (Archbishop) on Dec 28, 2006 at 18:56 UTC
I do indeed admire johngg 's regex approach (and have ++ed it), but at the same time, hesitate to walk away without pointing out that it has NO capacity to flag mis-nesting (mis-nesting by .html or .xml standards, that is) and suspect that at some point bwgoudey's input data may have an anomaly or two. Suppose the $html in johngg's Re: Split tags and words nicely were changed to: `q{<tag ref=1><tag ref=1a>Start<tag ref=2>and </tag><tag "ref=3">more</ +tag>and end};` [download] Note unbalanced opens (4) and closes (2) Leaving all else alone, output becomes: `<tag ref=1> <tag ref=1a> Start <tag ref=2> and </tag> <tag "ref=3"> more </tag> and end` [download] ... which offers no ready hint or markup or warning that the tags were mis-nested. This is part of the reason that so many monks will advise against trying to parse the likes of .html or .xml with regexen and advocate the use of some of the modules mentioned above.	[reply] [d/l] [select]
Re^3: Split tags and words nicely by johngg (Canon) on Dec 28, 2006 at 19:57 UTC
I agree completely with ww and reciprocate the ++. I am sure that a proper parser is by far the best approach for all but the very simplest and well behaved markup data. Unfortunately, I have done virtually nothing with HTML or XML as they haven't come my way in my current job. Because of that I can't post concrete examples of parser use, never having used one. I must rectify this. Cheers, JohnGG	[reply]