in reply to Converting HTML tags into uppercase using Perl

A valid HTML tag starts with a < followed by the name of the tag. A / character is also allowed following the < to indicate the closing tag. Whitespace can also be used in the tag to separate tokens.

The code below finds and replaces the tag names into upper case.

while (<>) { s/(<\s*\/?\s*)(\w+)/$1\U$2/g; print; }

Replies are listed 'Best First'.
Re^2: Converting HTML tags into uppercase using Perl
by davorg (Chancellor) on Nov 29, 2005 at 11:32 UTC

    See, this is why you should never try to parse arbitrary HTML with regular expressions. Your regex doesn't handle a number of very common occurances. The first thing that springs to mind is tags with attributes - the tag name will be upper-cased, but the attribute names will be left untouched. The original poster was unclear as to what sohuld be done in those circumstances.

    Also can you be sure that every < character in the document starts a tag? What if it was in a CDATA section?

    All in all, I think it's far better to use an HTML parser. They are there to be used, so why not use them?

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      I figured that this was a homework question anyway and so a reasonable bit of explanation would allow the student to get away with the numerous variations that exist in real HTML. The OP wants to uppercase his tags. He does not mention attributes so I have left it for him to look at.

      A CDATA section is not defined as an HTML tag as defined by the HTML 4 DTD but a <script> tag is which could contain conditional statements (e.g. start < end)that are matched by the regex. Tackling these issues is also something for the guy to look at.