regex and html tags

Parham has asked for the wisdom of the Perl Monks concerning the following question:

Someone came to me with a PHP question, I used perl to solve it :P. This was what the person wanted:
<tag#1>BLA</tag#1><tag#2>BLA</tag#1><tag#3>BLA</tag#1>
to turn into:
<tag#1>BLA</tag#1><tag#2>BLA</tag#2><tag#3>BLA</tag#3>
meaning the first tag opened had to be the first tag closed. There was a trick to this question though. The person wanted the finishing tag to be the first paramater in the html tag. So if i had < font size=2 > I would have to end it with < /size > and not < /font >. My first instinct was to do a while loop through the tags, find what i needed, do a replace, and continue:

$text = "<tag#1>BLA</tag#2><tag#2>BLA</tag#2><tag#3>BLA</tag#3>";

$number = 1;

while ($text =~ s/<(.+?)#\d>(.+?)<\/(.+?)#\d>/<$1#$number>$2<\/$3#$num
+ber>/) {
$number++;
last;
}

print "$text\n";
[download]

Now I couldn't even answer it, but the process would only work with one loop, thus the "last;" (I guess I didn't need to increment $number then :P). Anyway, that solution partially worked. So i tried again:

$text = "<FONT face=arial>this is <FONT SIZE=2>TWO</FONT>bla<FONT colo
+r=red>red</FONT>bla bla</FONT>";

$text =~ s#<(.+?\s(.+?)=.+?)>(.+?)<\/.+?>#<$1>$3<\/$2>#g;

print "$text\n";
[download]

which also partially worked. Although the problem is long past, the situation still creeps in my head. This is all old code, but the question has bothered me for a while. The only real reason I'm asking is because I want to gain some valuable experience from it. So I'm just wondering if anyone has a better solution than the two I provided above? Seeking the wisdom of the perlmonks :)

Edit: Added some <code> tags. larsen

Comment on regex and html tags Select or Download Code

Replies are listed 'Best First'.
Re: regex and html tags by graff (Chancellor) on Oct 21, 2002 at 03:07 UTC
This opening part makes sense: This was what the person wanted: `<tag#1>BLA</tag#1><tag#2>BLA</tag#1><tag#3>BLA</tag#1>` [download] to turn into: `<tag#1>BLA</tag#1><tag#2>BLA</tag#2><tag#3>BLA</tag#3>` [download] This is a matter of taking a badly formed html stream and making it well formed. This is sensible, and easily done in cases like your initial example, where there are no nested tags involved in the bad forms. (Your first attempt simply stopped after doing the first tag in the stream, and used a while loop for no purpose). The following would work over a series of non-nested tags: `s{(<(\w+.?)>[^<]?)</.+?>}{$1</$2>}g` [download] update: I'm using `>[^<]?` instead of `>.?` so that it won't corrupt streams that include properly nested tags. But working across nested tags would take more code and more care. You'd need to work through the stream tag by tag, pushing each open-tag name onto a stack, and popping the last name off the stack each time you hit a close-tag, to make sure the output was well formed (though it might still have other problems, depending on how bad the input was). But other stuff in your post makes little or no sense: The person wanted the finishing tag to be the first paramater in the html tag. So if i had < font size=2 > I would have to end it with < /size > and not < /font >. My first instinct was to do a while loop... My first instinct would be to say "No, you don't really want that. You're asking to have ill-formed html as the output. What makes you think you want that?" Then, looking at your last example, I think I understood the idea; you don't want well-formed html as output. You want a form where a person reading the stream can figure out more easily what the scope is for a given tag in a densely nested html structure. Is that it? If so, there are better ways to do this than corrupting the html tags in the odd way your friend suggested. What if the name of the first attribute is the least important information? Why have a "human-readable" form that can't be used reliably as input to a browser? For instance, one thing that can aid human readability of html is to simply place the tags and the text content on separate lines; something like this: `s/>\s*</>\n</g; # normalize whitespace between adjacent tags s/([^\n])</$1\n</g; # make sure every tag begins a new line s/>([^\n])/>\n$1/g; # make sure every tag is followed by newline` [download] More code and more care could be used to good effect, e.g. to indent the tag lines to reflect nesting depth, to eliminate new-lines from within long open-tags, etc.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: regex and html tags
by graff (Chancellor) on Oct 21, 2002 at 03:07 UTC

This was what the person wanted:

  <tag#1>BLA</tag#1><tag#2>BLA</tag#1><tag#3>BLA</tag#1>
[download]

to turn into:
<tag#1>BLA</tag#1><tag#2>BLA</tag#2><tag#3>BLA</tag#3>
[download]

This is a matter of taking a badly formed html stream and making it well formed. This is sensible, and easily done in cases like your initial example, where there are no nested tags involved in the bad forms. (Your first attempt simply stopped after doing the first tag in the stream, and used a while loop for no purpose). The following would work over a series of non-nested tags:

s{(<(\w+.*?)>[^<]*?)</.+?>}{$1</$2>}g
[download]

update:

>[^<]*?

>.*?

But working across nested tags would take more code and more care. You'd need to work through the stream tag by tag, pushing each open-tag name onto a stack, and popping the last name off the stack each time you hit a close-tag, to make sure the output was well formed (though it might still have other problems, depending on how bad the input was).

But other stuff in your post makes little or no sense:

The person wanted the finishing tag to be the first paramater in the html tag. So if i had < font size=2 > I would have to end it with < /size > and not < /font >. My first instinct was to do a while loop...

My first instinct would be to say "No, you don't really want that. You're asking to have ill-formed html as the output. What makes you think you want that?"

Then, looking at your last example, I think I understood the idea; you don't want well-formed html as output. You want a form where a person reading the stream can figure out more easily what the scope is for a given tag in a densely nested html structure. Is that it?

If so, there are better ways to do this than corrupting the html tags in the odd way your friend suggested. What if the name of the first attribute is the least important information? Why have a "human-readable" form that can't be used reliably as input to a browser?

For instance, one thing that can aid human readability of html is to simply place the tags and the text content on separate lines; something like this:

 s/>\s*</>\n</g; # normalize whitespace between adjacent tags
 s/([^\n])</$1\n</g; # make sure every tag begins a new line
 s/>([^\n])/>\n$1/g; # make sure every tag is followed by newline
[download]

[reply]
[d/l]
[select]