http://qs1969.pair.com?node_id=321171

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How would I get a group of @codes to go through my .html file and if it finds anything in @codes, it'll change it into lowercase?

Basically, I'm trying to rewrite my html tags that are all uppercase to all lowercase.. How close am I?

my $file = "index.html"; my @codes = ("<P>", "<BR>", "UL"); # and many more open(FILE, "> $file") or die "Oops: $!"; while(<FILE>); if ($_ eq @words) { $_ =~ s/@words/lc @word/g; } close(FILE);
Thanks.

Replies are listed 'Best First'.
Re: substuting a whole file
by ysth (Canon) on Jan 14, 2004 at 03:36 UTC
    First of all, unless you use a real html parser, you are going to not handle all HTML correctly this way. But if you really want to do it by hand, I'd do something like this:
    use strict; use warnings; my $file = "index.html"; my @codes = ("<p>", "<BR>", "<UL>"); my $codes_regex = join "|", map quotemeta $_, sort { length $b <=> length $a } @codes; # slurp the file open my $in, "<", $file or die "couldn't open $file: $!"; my $text = do {local $/; <$in>}; close $in; # lower case all the codes in text $text =~ s/($codes_regex)/lc $1/gie; # write the file open my $out, ">", $file or die "couldn't open $file: $!"; print $out $text; close $out;
    The sort is only necessary if some codes are proper substrings of others (e.g. "ab" and "abc") and prevents "abc" from matching just "ab".

    The quotemeta is needed if your codes have characters in them that are special to regexes.

      UPDATE: I have to agree with the rest of them. For safety reasons (so you don't demolish the test file), you may want to open $file but save to $file2 just incase the unexpected happens..

      I have a few questions or comments about the script you wrote. How exactly isn't this going to treat HTML correctly? All you're doing is taking any text it finds, regardless of what characters it is, and try to put it into lowercase. You're not treating HTML, you're treating text. I tested your script out with <A HREF= in the @codes and it worked fine. It's not interpreting the file as HTML at all, so no matter what you throw in there (thanks to quotemeta), it'll do it's job.

      My question to you was, what exactly is line 5 doing with the joining, maping and sorting? You're playing with length which I thought only stored the length in characters of the item you're using it with.



      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

      sulfericacid
        UPDATE: I have to agree with the rest of them. For safety reasons (so you don't demolish the test file), you may want to open $file but save to $file2 just incase the unexpected happens..
        I agree, and will update my node to do so.
        How exactly isn't this going to treat HTML correctly? ... It's not interpreting the file as HTML at all
        That's all I meant; it won't look for HTML tags, it will look for literal text, including what it finds in comments, script, etc.
        My question to you was, what exactly is line 5 doing with the joining, maping and sorting? You're playing with length which I thought only stored the length in characters of the item you're using it with.
        Sorting greatest length first ensures that the match will work if you have e.g. "<A " and "<A HREF". Without the sort, you get results like:
        $ perl use warnings; use strict; my @codes = ("<a ", "<a href"); my $codes_regex = join "|", map quotemeta $_, # sort { length $b <=> length $a } @codes; my $text = "testing a link: <A HREF=\"fooble.html\">boofle</a>"; print "in: $text\n"; $text =~ s/($codes_regex)/lc $1/gie; print "out: $text\n"; __END__ output with the sort: in: testing a link: <A HREF="fooble.html">boofle</a> out: testing a link: <a href="fooble.html">boofle</a> and without: in: testing a link: <A HREF="fooble.html">boofle</a> out: testing a link: <a HREF="fooble.html">boofle</a>
        This is because the perl regexes prefer the leftmost |'d alternative, even if it makes a shorter match.

        The map is just to apply the quotemeta; the join is to put | between tags.

Re: substuting a whole file
by Roger (Parson) on Jan 14, 2004 at 03:21 UTC
    Hate to say this, but it's not close. Looks like you are yet to grasp some of the Perl basics.

    I would recommend the great utility HTML Tidy for doing these kind of stuff instead of rolling your own.

Re: substuting a whole file
by jweed (Chaplain) on Jan 14, 2004 at 03:35 UTC
    Update:
    Roger pointed out some gross errors. Sorry! Fixed now!

    Well, other than a few errors, you're on your way. I'll do a line by line on this baby and see what I think.

    my $file = "index.html"; my @codes = ("<P>", "<BR>", "UL"); # and many more
    Do you mean to be inconsistent here? Or should "UL" be "<UL>"?
    open(FILE, "> $file") or die "Oops: $!"; while(<FILE>);
    Your open is clobbering index.html. Not what you want. Also, you never write out to the file to actually change it in memory.
    Try  while (<FILE>) { :)
    if ($_ eq @words) { $_ =~ s/@words/lc @word/g; }
    Here's where you have your real problems. The if statement compares the file line to the number of items in @codes. Probably not what you want. Also, eq won't work unless you have only one word per line. Also probably not what you want.
    close(FILE);
    Way to go.

    I would rewrite it like this:
    #!/usr/bin/perl -i use warnings; use strict; my @codes = ("<P>", "<BR>", "<UL>"); # and many more while(<>); for my $word (@words) { $_ =~ s/$word/lc $word/g; } close(FILE);
    Call it with the filename as argument.
    Of course, this is still horribly simple minded because it breaks on constructs like <P class="hi">.

    Good luck fixing that with a simple regex.



    Code is (almost) always untested.
    http://www.justicepoetic.net/
      open(FILE, "> $file") or die "Oops: $!";

      That will clobber the index.html file. Not a good idea I am afraid.