Converting HTML tags into uppercase using Perl

steve_g50 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Converting HTML tags into uppercase using Perl by davorg (Chancellor) on Nov 29, 2005 at 11:04 UTC
It would be really simple to knock up something that did this using HTML::Parser, but it's perhaps worth pointing out that if you are at all interested in XHTML compatibility then valid XHTML tags are all lower case. Update: Here's a basic HTML::Parser solution. It can almost certainly be improved and/or simplified. `#!/usr/bin/perl use strict; use warnings; use HTML::Parser; my $p = HTML::Parser->new(start_h => [\&start, 'tagname, attr, attrseq +'], end_h => [\&end, 'tagname'], text_h => [\&text, 'text']); $p->parse_file(shift); sub start { my ($name, $attr, $attrseq) = @_; print '<' . uc($name); if (keys %$attr) { foreach (@$attrseq) { print ' ' . uc($_) . '="' . $attr->{$_} . '"'; } } print '>'; } sub end { print '</' . uc($_[0]) . '>'; } sub text { print $_[0]; }` [download] -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re^2: Converting HTML tags into uppercase using Perl by steve_g50 (Initiate) on Nov 30, 2005 at 11:10 UTC
Ive tried this, but i can't get it to register the filename after i've entered it. Any ideas? #!/usr/bin/perl use warnings; use HTML::Parser; print("Enter an html file (with either a .html or .htm extension): "); $file=<STDIN>; my $file = $ARGV[0]; unless ($file) { print ("No filename given\n"); exit; } my $new; my $p = HTML::Parser->new( start_h => [ \&start_h, 'tagname, text' ], end_h => [\&end_h, 'tagname, text' ], default_h => [sub { $new .= shift }, 'text'], ); $p->parse_file($file); # Rename the old file my $newfile = $file.'.old'; rename($file, $newfile) or die "Can't rename $file: $!"; # Write the new text to the old filename open my $fh, ">", $file or die "Can't create new file: $!"; print $fh $new; close $fh; sub start_h { my($tag, $text) = @_; my $uc = uc $tag; $text =~ s/$tag/$uc/; $new .= $text; } sub end_h { my($tag, $text) = @_; my $uc = uc $tag; $text =~ s/$tag/$uc/; $new .= $text; } [download]	[reply] [d/l]
Re^3: Converting HTML tags into uppercase using Perl by davorg (Chancellor) on Nov 30, 2005 at 11:41 UTC
$file=<STDIN>; my $file = $ARGV[0]; This looks pretty confused to me. You read the filename from STDIN into a package variable called `$file` (incidently, you don't `chomp` that value so it still has a newline character on the end). You then ignore that value and create a new, lexical, variable also called `$file` and into that you copy the value of the first command line argument. You don't say how you call the program, but if you don't give it any command line arguments then that will be 'undef'. You then ignore the package variable (which has the correct value - albeit with an extra newline) and continue to use the lexical value which (probably) contains 'undef'. So, no, it almost certainly won't do what you want :) This is a good example of why you should always have `use strict` in your programs. You probably want to write that code something like this (untested): `# check to see if you have a command line argument my $file = $ARGV[0]; # if not, or if it's not an HTML file, then prompt for one until ($file && ($file =~ /\.html?$/i)) { print('Enter an html file (with either a .html or .htm extension): ' +); $file=<STDIN>; chomp $file; }` [download] -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re^4: Converting HTML tags into uppercase using Perl by steve_g50 (Initiate) on Nov 30, 2005 at 13:20 UTC
Re: Converting HTML tags into uppercase using Perl by holli (Abbot) on Nov 29, 2005 at 11:18 UTC
The following uses HTML::Tokeparser and should give you a starting point: use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new( "file.html" ); while ( my $t = $p->get_token ) { #forward comments, text and declarations if ( $t->[0] =~ /[CDT]/ ) { print $t->[1]; } #uppercase start tags elsif ( $t->[0] =~ /S/ ) { print "<", uc($t->[1]), " ", join (" ", map { uc($_) . '="' . $t->[2]->{$_} . '"' } @{$t- +>[3]}), ">"; } #uppercase end tag elsif ( $t->[0] =~ /E/ ) { print uc($t->[2]); } #forward processing instruction elsif ( $t->[0] =~ /PI/ ) { print $t->[2]; } } [download] holli, /regexed monk/	[reply] [d/l]
Re: Converting HTML tags into uppercase using Perl by planetscape (Chancellor) on Nov 30, 2005 at 01:40 UTC
Also see HTML Tidy's `-upper` directive. HTH, planetscape	[reply] [d/l]
OT Re: Converting HTML tags into uppercase using Perl by ww (Archbishop) on Nov 29, 2005 at 12:24 UTC
...and, while uppercase tags are allowed under html 4.01, they are NOT allowed in xhtml ~~xml~~ Slap ww upside the head!... so if this is other than homework, steve_g50 may wish to learn a bit more about .html as well as about perl. Update: Grinder is, of course, correct both re xml and re need to provide cites, and Fletch, thanks! Your cite is bang_on. Moral (and message to self): ensure caffeine levels are within normal operating range and put brain in gear before typing.	[reply]
Re: OT Re: Converting HTML tags into uppercase using Perl by grinder (Bishop) on Nov 29, 2005 at 12:43 UTC
uppercase tags [...] are NOT allowed in xml ww may wish to learn more about XML, or least be able to quote the specification chapter and verse in order to back up such a claim. I've been doing XML for years (and SGML before that) and I've never heard of such nonsense. A `start-tag` is a `Name`, and a `Name` is one or more `Letter`s (more or less, ignoring namespace issues), and a `Letter` may be drawn from many, many things, including, but not limited to, uppercase and lowercase letters. See the section on logical structures in the XML specification for more information. Update: my bad, I did ponder how ww could have come up with such an outlandish idea (because his/her advice is spot-on in general), and I failed to make the connection to XHTML. I just wanted to quash the meme before it got any further. • another intruder with the mooring in the heart of the Perl	[reply]
Re^2: OT Re: Converting HTML tags into uppercase using Perl by Fletch (Bishop) on Nov 29, 2005 at 14:01 UTC
I think he misspoke and meant "XHTML" rather than XML. While you are correct that XML allows upper-, lower-, and mixed-case tag names, the XHTML spec does specifically require lowercase: 4.2. Element and attribute names must be in lower case XHTML documents must use lower case for all HTML element and attribute names. This difference is necessary because XML is case-sensitive e.g. <li> and <LI> are different tags. http://www.w3.org/TR/xhtml1/#h-4.2	[reply]
Re: Converting HTML tags into uppercase using Perl by mlh2003 (Scribe) on Nov 30, 2005 at 12:01 UTC
Please don't post on two separate forums. You will be darting between both and getting more confused about a suitable solution to your problem - particularly if both threads take different approaches. Admittedly there are common ideas in both, but you're best to stick with one and run with that. _______ Code is untested unless explicitly stated mlh2003	[reply]
Re: Converting HTML tags into uppercase using Perl by Samy_rio (Vicar) on Nov 29, 2005 at 11:03 UTC
~~Hi steve_g50, Try this,~~ `#!/usr/bin/perl -w use strict; local $/; open(INPUT, "input.html") \|\| die("$!"); open(OUTPUT, ">output.html") \|\| die ("$!"); my $txt = <INPUT>; $txt=~s/<([^> ])([^>]>)/"<".uc($1)."$2"/egsi; #Updated: only element + names except attributes print OUTPUT $txt;` [download] Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re^2: Converting HTML tags into uppercase using Perl by Corion (Patriarch) on Nov 29, 2005 at 11:09 UTC
This script of course breaks for the case of an embedded "`>`" sign in the value, and it uppercases all values too, both of which will break the HTML file: `<html> <img src="a_greater_b.gif" alt="a > b" /> <img src="a_smaller_b.gif" alt="a < b"/> </html>` [download] These niggles are the reason why it is always recommended to avoid parsing HTML with regular expressions. Update: Rearranged HTML to be a test case for the second problem as well.	[reply] [d/l]
Re^2: Converting HTML tags into uppercase using Perl by steve_g50 (Initiate) on Nov 29, 2005 at 13:16 UTC
Thanks Samy_rio, i'll see what i can do with your program. Thank you to everyone else for your help. Keep replying if you can still help with my problem. Merry Christmas to all of you who celebrate it.	[reply]
Re^2: Converting HTML tags into uppercase using Perl by steve_g50 (Initiate) on Nov 29, 2005 at 11:53 UTC
This works very well, thanks. But how do i get the program to ASK for a .htm or .html file, then change the tags in the given file to uppercase, and THEN save the new file using the OPEN fuction? Thanks again.	[reply]
Re^3: Converting HTML tags into uppercase using Perl by davorg (Chancellor) on Nov 29, 2005 at 12:17 UTC
It doesn't work very well for all of the reasons thar Corion listed. It will break badly on various (common) types of HTML. Please look at using a solution that uses a real HTML parser. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: Converting HTML tags into uppercase using Perl by kulls (Hermit) on Nov 29, 2005 at 13:46 UTC
Basically, why do you need like this?.may i know the details?.SO that, it'll leads to solve the probs in better ways -kulls	[reply]
Re: Converting HTML tags into uppercase using Perl by inman (Curate) on Nov 29, 2005 at 11:21 UTC
A valid HTML tag starts with a < followed by the name of the tag. A / character is also allowed following the < to indicate the closing tag. Whitespace can also be used in the tag to separate tokens. The code below finds and replaces the tag names into upper case. `while (<>) { s/(<\s\/?\s)(\w+)/$1\U$2/g; print; }` [download]	[reply] [d/l]
Re^2: Converting HTML tags into uppercase using Perl by davorg (Chancellor) on Nov 29, 2005 at 11:32 UTC
See, this is why you should never try to parse arbitrary HTML with regular expressions. Your regex doesn't handle a number of very common occurances. The first thing that springs to mind is tags with attributes - the tag name will be upper-cased, but the attribute names will be left untouched. The original poster was unclear as to what sohuld be done in those circumstances. Also can you be sure that every < character in the document starts a tag? What if it was in a CDATA section? All in all, I think it's far better to use an HTML parser. They are there to be used, so why not use them? -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re^3: Converting HTML tags into uppercase using Perl by inman (Curate) on Nov 29, 2005 at 11:59 UTC
I figured that this was a homework question anyway and so a reasonable bit of explanation would allow the student to get away with the numerous variations that exist in real HTML. The OP wants to uppercase his tags. He does not mention attributes so I have left it for him to look at. A CDATA section is not defined as an HTML tag as defined by the HTML 4 DTD but a <script> tag is which could contain conditional statements (e.g. start < end)that are matched by the regex. Tackling these issues is also something for the guy to look at.	[reply]