Re: Converting HTML tags into uppercase using Perl
by davorg (Chancellor) on Nov 29, 2005 at 11:04 UTC
|
It would be really simple to knock up something that did this using HTML::Parser, but it's perhaps worth pointing out that if you are at all interested in XHTML compatibility then valid XHTML tags are all lower case.
Update: Here's a basic HTML::Parser solution. It can almost certainly be improved and/or simplified.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $p = HTML::Parser->new(start_h => [\&start, 'tagname, attr, attrseq
+'],
end_h => [\&end, 'tagname'],
text_h => [\&text, 'text']);
$p->parse_file(shift);
sub start {
my ($name, $attr, $attrseq) = @_;
print '<' . uc($name);
if (keys %$attr) {
foreach (@$attrseq) {
print ' ' . uc($_) . '="' . $attr->{$_} . '"';
}
}
print '>';
}
sub end {
print '</' . uc($_[0]) . '>';
}
sub text {
print $_[0];
}
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] [d/l] |
|
|
Ive tried this, but i can't get it to register the filename after i've entered it. Any ideas?
#!/usr/bin/perl
use warnings;
use HTML::Parser;
print("Enter an html file (with either a .html or .htm extension): ");
$file=<STDIN>;
my $file = $ARGV[0];
unless ($file) {
print ("No filename given\n");
exit;
}
my $new;
my $p = HTML::Parser->new(
start_h => [ \&start_h, 'tagname, text' ],
end_h => [\&end_h, 'tagname, text' ],
default_h => [sub { $new .= shift }, 'text'],
);
$p->parse_file($file);
# Rename the old file
my $newfile = $file.'.old';
rename($file, $newfile) or die "Can't rename $file: $!";
# Write the new text to the old filename
open my $fh, ">", $file or die "Can't create new file: $!";
print $fh $new;
close $fh;
sub start_h {
my($tag, $text) = @_;
my $uc = uc $tag;
$text =~ s/$tag/$uc/;
$new .= $text;
}
sub end_h {
my($tag, $text) = @_;
my $uc = uc $tag;
$text =~ s/$tag/$uc/;
$new .= $text;
}
| [reply] [d/l] |
|
|
$file=<STDIN>;
my $file = $ARGV[0];
This looks pretty confused to me. You read the filename from STDIN into a package variable called $file (incidently, you don't chomp that value so it still has a newline character on the end). You then ignore that value and create a new, lexical, variable also called $file and into that you copy the value of the first command line argument. You don't say how you call the program, but if you don't give it any command line arguments then that will be 'undef'. You then ignore the package variable (which has the correct value - albeit with an extra newline) and continue to use the lexical value which (probably) contains 'undef'.
So, no, it almost certainly won't do what you want :)
This is a good example of why you should always have use strict in your programs.
You probably want to write that code something like this (untested):
# check to see if you have a command line argument
my $file = $ARGV[0];
# if not, or if it's not an HTML file, then prompt for one
until ($file && ($file =~ /\.html?$/i)) {
print('Enter an html file (with either a .html or .htm extension): '
+);
$file=<STDIN>;
chomp $file;
}
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] [d/l] |
|
|
Re: Converting HTML tags into uppercase using Perl
by holli (Abbot) on Nov 29, 2005 at 11:18 UTC
|
The following uses HTML::Tokeparser and should give you a starting point:
use strict;
use warnings;
use HTML::TokeParser;
my $p = HTML::TokeParser->new( "file.html" );
while ( my $t = $p->get_token )
{
#forward comments, text and declarations
if ( $t->[0] =~ /[CDT]/ )
{
print $t->[1];
}
#uppercase start tags
elsif ( $t->[0] =~ /S/ )
{
print
"<",
uc($t->[1]),
" ",
join (" ", map { uc($_) . '="' . $t->[2]->{$_} . '"' } @{$t-
+>[3]}),
">";
}
#uppercase end tag
elsif ( $t->[0] =~ /E/ )
{
print uc($t->[2]);
}
#forward processing instruction
elsif ( $t->[0] =~ /PI/ )
{
print $t->[2];
}
}
| [reply] [d/l] |
Re: Converting HTML tags into uppercase using Perl
by planetscape (Chancellor) on Nov 30, 2005 at 01:40 UTC
|
| [reply] [d/l] |
OT Re: Converting HTML tags into uppercase using Perl
by ww (Archbishop) on Nov 29, 2005 at 12:24 UTC
|
...and, while uppercase tags are allowed under html 4.01, they are NOT allowed in xhtml xml Slap ww upside the head!... so if this is other than homework, steve_g50 may wish to learn a bit more about .html as well as about perl.
Update: Grinder is, of course, correct both re xml and re need to provide cites, and Fletch, thanks! Your cite is bang_on.
Moral (and message to self): ensure caffeine levels are within normal operating range and put brain in gear before typing.
| [reply] |
|
|
uppercase tags [...] are NOT allowed in xml
ww may wish to learn more about XML, or least be able to quote the specification chapter and verse in order to back up such a claim. I've been doing XML for years (and SGML before that) and I've never heard of such nonsense.
A start-tag is a Name, and a Name is one or more Letters (more or less, ignoring namespace issues), and a Letter may be drawn from many, many things, including, but not limited to, uppercase and lowercase letters.
See the section on logical structures in the XML specification for more information.
Update: my bad, I did ponder how ww could have come up with such an outlandish idea (because his/her advice is spot-on in general), and I failed to make the connection to XHTML. I just wanted to quash the meme before it got any further.
• another intruder with the mooring in the heart of the Perl
| [reply] |
|
|
I think he misspoke and meant "XHTML" rather than XML. While you are correct that XML allows upper-, lower-, and mixed-case tag names, the XHTML spec does specifically require lowercase:
4.2. Element and attribute names must be in lower case
XHTML documents must use lower case for all HTML element and attribute names. This difference is necessary because XML is case-sensitive e.g. <li> and <LI> are different tags.
http://www.w3.org/TR/xhtml1/#h-4.2
| [reply] |
Re: Converting HTML tags into uppercase using Perl
by mlh2003 (Scribe) on Nov 30, 2005 at 12:01 UTC
|
| [reply] |
Re: Converting HTML tags into uppercase using Perl
by Samy_rio (Vicar) on Nov 29, 2005 at 11:03 UTC
|
#!/usr/bin/perl -w
use strict;
local $/;
open(INPUT, "input.html") || die("$!");
open(OUTPUT, ">output.html") || die ("$!");
my $txt = <INPUT>;
$txt=~s/<([^> ]*)([^>]*>)/"<".uc($1)."$2"/egsi; #Updated: only element
+ names except attributes
print OUTPUT $txt;
Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';
| [reply] [d/l] [select] |
|
|
<html>
<img src="a_greater_b.gif" alt="a > b" />
<img src="a_smaller_b.gif" alt="a < b"/>
</html>
These niggles are the reason why it is always recommended to avoid parsing HTML with regular expressions.
Update: Rearranged HTML to be a test case for the second problem as well. | [reply] [d/l] |
|
|
Thanks Samy_rio, i'll see what i can do with your program.
Thank you to everyone else for your help.
Keep replying if you can still help with my problem.
Merry Christmas to all of you who celebrate it.
| [reply] |
|
|
This works very well, thanks. But how do i get the program to ASK for a .htm or .html file, then change the tags in the given file to uppercase, and THEN save the new file using the OPEN fuction?
Thanks again.
| [reply] |
|
|
| [reply] |
Re: Converting HTML tags into uppercase using Perl
by kulls (Hermit) on Nov 29, 2005 at 13:46 UTC
|
Basically,
why do you need like this?.may i know the details?.SO that, it'll leads to solve the probs in better ways
-kulls | [reply] |
Re: Converting HTML tags into uppercase using Perl
by inman (Curate) on Nov 29, 2005 at 11:21 UTC
|
while (<>)
{
s/(<\s*\/?\s*)(\w+)/$1\U$2/g;
print;
}
| [reply] [d/l] |
|
|
See, this is why you should never try to parse arbitrary HTML with regular expressions. Your regex doesn't handle a number of very common occurances. The first thing that springs to mind is tags with attributes - the tag name will be upper-cased, but the attribute names will be left untouched. The original poster was unclear as to what sohuld be done in those circumstances.
Also can you be sure that every < character in the document starts a tag? What if it was in a CDATA section?
All in all, I think it's far better to use an HTML parser. They are there to be used, so why not use them?
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
|
|
| [reply] |