Converting HTML tags to upper case

Bern has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Converting HTML tags to upper case by marto (Cardinal) on Dec 11, 2006 at 10:30 UTC
Hi Bern, What have you tried so far? Do you have any working code or is there a particular part of this problem that you need help with? You say that you only have basic Perl knowledge, the Tutorials section of this site is chock full of useful information. You should probably check out Ovid's HTML::TokeParser::Simple module. I am sure that if you Super Search this site, you will be able to find plenty of example code that does what you want, however your question looks like you have posted a specification for a program you need to write, and are waiting for someone to do for you. Update: In addition to the advice I gave you in response to your previous post, Table Help, please read How do I compose an effective node title? Martin	[reply]
Re: Converting HTML tags to upper case by cdarke (Prior) on Dec 11, 2006 at 10:40 UTC
When you are starting out with any language you can get a "blank screen" syndrome. The trick is to break the problem down into small chunks. I suppose I could write the script for you, but instead I'm going to suggest an approach you might like to take so you can begin learning more of the language. You have a fairly clear spec on what you need, so start by taking that spec as a set of comments. Now "fill in the blanks" using your research and the online documentation, always remembering that "There's more than one way to do it": # prompt the user to enter the name of a HTML file. Use print, the read from STDIN # The file will need verifying to check that the file extenstion is # .html or .htm (upper case or lower case is fine). Use a regular expression or string comparison # Upon input of the validated name the file needs to be processed Open the file Read each record in a loop # and all lowercase tage, e.g (<html>) need converting to # uppercase (<HTML>), # no other text should be converted You could craft a regular expression, but also checkout the many HTML +helper modules on CPAN # also tag attribute values need to remain, # e.g (<img src="picture.jpg"> should convert to # <IMG SRC="picture.jpg"> and not <IMG SRC="PICTURE.JPG">. likewise # At the end of the processing, the original file needs to be renamed # with the .old extension instead of .htm or .html # and the processed file should be given the original file name. Generate the new filename, then use rename [download] Now when you need more specific help with each section you can look up the online documentation and web resources, including SuperSearch.	[reply] [d/l]
Re: Converting HTML tags to upper case by PockMonk (Beadle) on Dec 11, 2006 at 10:53 UTC
Your post seems so very specific that if I was cynical I would suspect this was some sort of assignment rather than a problem you were trying to overcome yourself. If you post up what you've got so far, what you think it should do, and what it actually does do, then people would gladly point out helpful corrections to your code. If you really are very new to Perl, you need to look at "regular expressions". If you don't have a perl book or perl installation with manpages, google search "perl regular expressions". If you're not sure how to tackle this, someone providing you a throwaway solution without you learning anything does you no favours in the long run. Better to learn up on the general principles, have a stab at it, and then get help if your effort doesn't quite work Have a stab at it, and post up with your results :-)	[reply]
Re: Converting HTML tags to upper case by ww (Archbishop) on Dec 11, 2006 at 16:34 UTC
Sorry, this is another reply that's not a direct and complete answer. In fact, it's well OffTopic as far as the Perl element of your question But I would recommend, strongly against uppercasing html tags. Doing so will make switching to XHTML, should the need arise, more difficult, as tags like <HTML> or <P...> or <DIV...> don't comply with the XHTML standard. I'd also point out that blithely converting the original files to .old from either.html or .htm runs a risk of (sequentially) processing 1.htm and 1.html (well, having both in a single system is no wackier, IMO, that the conversion you're seeking to do) which would mean the total loss of the original of 1.htm, when the processed version of 1.html is (subsequently) written to the same dir.	[reply]
Re: Converting HTML tags to upper case by talexb (Chancellor) on Dec 11, 2006 at 18:46 UTC
Silly question, perhaps, but why are you doing this? Is it just an exercise to improve your coding skills? You probably already know that HTML tags are case insensitive. Just curious. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re^2: Converting HTML tags to upper case by Bern (Initiate) on Dec 12, 2006 at 12:54 UTC
I have taken what cdarke has advised on board and tried to break it down into 6 small chunks: # 1 - Prompt the user to enter the filename . #2 - Verify the file extension to make sure it's htm or html. #3 - Open the file and read each line of the file. #4 - Change all tags to uppercase. #5 - Tag attributes need to remain in lowercase. #6 - Original file needs renaming with the .old extension and the processed file needs the original filename. This is what I've got so far (not a great deal of progress). #1 `Print ("Please enter the name of your html file\n"); $file = (<STDIN>);` [download] #2 (I have the following regular expression) `($file =~ s/html\|htm$/i).` Would I be best putting this in an IF statement ?. #3 & #4 `open (IN, $file); open (OUT, >$file); while ($line = [IN]) { $line =~ s/[html]/[HTML]/;<code> # Would I have to do the above line for all possible tags or is there +an easier way ?. <code>(print OUT $line); }` [download] Struggling on part #5 !. #6 Got the feeling I'm going to need to use something like the following. `rename $file, "$file.old"; open (IN, "<$file.old"); Open (OUT, ">$file");` [download]	[reply] [d/l] [select]
Re^3: Converting HTML tags to upper case by wfsp (Abbot) on Dec 12, 2006 at 13:50 UTC
Here's my stab at #4 and #5. It uses HTML::TokeParser::Simple as suggested by marto above. If there is HTML in the air I won't leave home without it. :-) #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $file = 'home.html'; my $p = HTML::TokeParser::Simple->new($file); my $html; while (my $t = $p->get_token){ if ($t->is_start_tag){ my $tag = '<'. uc $t->get_tag; my $attr = $t->get_attr; my $inline; for my $name (keys %{$attr}){ $inline++, next if $name eq '/'; my $name_value = sprintf " %s=\"%s\"", uc $name, lc $attr->{$nam +e}; $tag .= $name_value; } $tag .= ' /' if $inline; $tag .= '>'; $html .= $tag; } elsif ($t->is_end_tag){ my $tag = sprintf "</%s>", uc $t->get_tag; $html .= $tag; } else{ $html .= $t->as_is; } } print $html; [download]	[reply] [d/l]
Re^3: Converting HTML tags to upper case by talexb (Chancellor) on Dec 12, 2006 at 14:35 UTC
Some quick comments .. In the UNIX/Linux world that Perl mostly lives in, it is more likely that the file name you'll be working on will be passed in as an argument, rather than supplied interactively, in response to a prompt. This allows us to do clever things like find the files we want to work on (perhaps by using `ls` and `xargs` and then call your utility. Figure out a plan for what your file names are going to be. Typically you would do something like this: original file is renamed by adding '.org' or '.orig' to it (for original), and a new file is created with the original file name. Alternatively, '.org' can be '.bak' (for backup). You asked Would I have to do the above line for all possible tags or is there an easier way, and the answer is, yes, use one of the modules suggested to you already. Parsing HTML with a regular expression is a tempting challenge, but it's a fool's errand. That means don't try to do it unless you want a Greek chorus chanting "Don't do that!" when you come and ask for assistance. There's no `rename` command in UNIX/Linux -- we use `mv` (move) instead, and if you're going to be copying, moving or renaming files, best use File::Copy instead, for the following reasons: It's tried and true code. It works on multiple platforms. it will handle all of the weird cases that you never thought of but will wake you at 4am if your homebrew code gets installed on a Production machine and it breaks in the worst possible way. You won't have a Greek chorus chanting "Use CPAN" all the time. Actually, File::Copy is a core module, so you don't even have to go to CPAN for it. There's a lot to learn, so search for articles here on Perlmonks and do lots of reading. 99% of the time, the thing you want to do has already been thought of and coded up. It's amusing to reinvent the wheel, but only do so if you have plenty of time to learn. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l] [select]
Re^3: Converting HTML tags to upper case by ww (Archbishop) on Dec 12, 2006 at 14:35 UTC
Bern: In keeping with your description of your experience, here's a rather basic trick you may find useful Re #2 - One good way to spoeed your learning is to run such constructs through a "try-out." Here's a fairly simply way to do so (using one specific bit of your proposed code): `my @files = ("foo.htm","two.htm","three.HTM","FOUR.HTML"); # etc etc for as many variants as you like foreach $file(@files) { if ($file =~ s/html\|htm$/i) { print $file; } }` [download] (What's happening above is that we're stuffing a variety of possible names into an array, which makes for an easy, compact way to check them all.) and then tell perl to run a check ( `-c` ): >perl -c bern.pl Substitution replacement not terminated at bern.pl line 6. > Now, Perl has told you there's something wrong with that scheme (Hint: You've said you're checking to make sure the file has an .htm or .html extension. So why are you using substitution ( `s///` )? Perhaps that's an "aha" moment. We don't want to substitute in the test (and yes, we do need that `if` in order to test without being sensitive to case). So what happens if we edit the script to test for a MATCH instead of attempting substitution? `my @files = ("foo.htm","two.htm","three.HTM","FOUR.HTML"); # etc etc +for as many variants as you like foreach $file(@files) { if ($file =~ /html\|htm$/i) { print $file; } }` [download] Now, it passes the check... and running the script does this: >perl bern.pl foo.htmtwo.htmthree.HTMFOUR.HTML > Well, ugly, but "yep" -- all four bits of test data matched either htm or html. So we're done, right? *BRRRRRRRRAATTTTTT!* NO! Let us suppose some evil user (and if you don't think "evil user" is redundant, beware!) tried to foist a file like "oneHTML.xyz" on you? Well, try it! (...short pause while you do so) OK, so now we know the test works in part, but not well enough to do what you want -- that is, not well enough to restrict the acceptable files to those with an extension of .htm, .html, .HTM or .HTML (though your trailing "i" does provide the case insensitivity you probably want in the uploaded filename.) That means it's time to read some more; say, for example, `perldoc perlretut` or one of the many nodes here on regular expressions. And then, just for the record, the advice you'll see frequently in what you read here, "Don't use regexen to parse html," refers to the NEXT step of your journey. Using a regex is just fine (at least IMO) to test a filename.	[reply] [d/l] [select]