This opening part makes sense:
This was what the person wanted:
<tag#1>BLA</tag#1><tag#2>BLA</tag#1><tag#3>BLA</tag#1>
to turn into:
<tag#1>BLA</tag#1><tag#2>BLA</tag#2><tag#3>BLA</tag#3>
This is a matter of taking a badly formed html stream and
making it well formed. This is sensible, and easily done
in cases like your initial example, where
there are no nested tags involved in the bad forms.
(Your first attempt simply stopped after doing the first tag
in the stream, and used a while loop for no purpose). The
following would work over a series of non-nested tags:
s{(<(\w+.*?)>[^<]*?)</.+?>}{$1</$2>}g
update: I'm using >[^<]*?
instead of >.*? so that it won't corrupt
streams that include properly nested tags.
But working across nested tags would take more code and more
care. You'd need to work
through the stream tag by tag, pushing each open-tag name
onto a stack, and popping the last name off the stack each
time you hit a close-tag, to make sure the output was well
formed (though it might still have other problems, depending
on how bad the input was).
But other stuff in your post makes little or no sense:
The person wanted the finishing tag to be the first
paramater in the html tag. So if i had < font size=2 >
I would have to end it with < /size > and not
< /font >. My first instinct was to do a while loop...
My first instinct would be to say "No, you don't really want
that. You're asking to have ill-formed html as the output.
What makes you think you want that?"
Then, looking at your last example, I think I understood the
idea; you don't want well-formed html as output. You want
a form where a person reading the stream can figure out more
easily what the scope is for a given tag in a densely nested
html structure. Is that it?
If so, there are better ways to do this than corrupting
the html tags in the odd way your friend suggested. What
if the name of the first attribute is the least important
information? Why have a "human-readable" form that can't be used
reliably as input to a browser?
For instance, one thing that can aid human readability of html
is to simply place the tags and the text content on separate
lines; something like this:
s/>\s*</>\n</g; # normalize whitespace between adjacent tags
s/([^\n])</$1\n</g; # make sure every tag begins a new line
s/>([^\n])/>\n$1/g; # make sure every tag is followed by newline
More code and more care could be used to good effect,
e.g. to indent the tag lines to reflect nesting depth, to
eliminate new-lines from within long open-tags, etc.
|