Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: substuting a whole file

by ysth (Canon)
on Jan 14, 2004 at 03:36 UTC ( [id://321177]=note: print w/replies, xml ) Need Help??


in reply to substuting a whole file

First of all, unless you use a real html parser, you are going to not handle all HTML correctly this way. But if you really want to do it by hand, I'd do something like this:
use strict; use warnings; my $file = "index.html"; my @codes = ("<p>", "<BR>", "<UL>"); my $codes_regex = join "|", map quotemeta $_, sort { length $b <=> length $a } @codes; # slurp the file open my $in, "<", $file or die "couldn't open $file: $!"; my $text = do {local $/; <$in>}; close $in; # lower case all the codes in text $text =~ s/($codes_regex)/lc $1/gie; # write the file open my $out, ">", $file or die "couldn't open $file: $!"; print $out $text; close $out;
The sort is only necessary if some codes are proper substrings of others (e.g. "ab" and "abc") and prevents "abc" from matching just "ab".

The quotemeta is needed if your codes have characters in them that are special to regexes.

Replies are listed 'Best First'.
Re: Re: substuting a whole file
by sulfericacid (Deacon) on Jan 14, 2004 at 18:17 UTC
    UPDATE: I have to agree with the rest of them. For safety reasons (so you don't demolish the test file), you may want to open $file but save to $file2 just incase the unexpected happens..

    I have a few questions or comments about the script you wrote. How exactly isn't this going to treat HTML correctly? All you're doing is taking any text it finds, regardless of what characters it is, and try to put it into lowercase. You're not treating HTML, you're treating text. I tested your script out with <A HREF= in the @codes and it worked fine. It's not interpreting the file as HTML at all, so no matter what you throw in there (thanks to quotemeta), it'll do it's job.

    My question to you was, what exactly is line 5 doing with the joining, maping and sorting? You're playing with length which I thought only stored the length in characters of the item you're using it with.



    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid
      UPDATE: I have to agree with the rest of them. For safety reasons (so you don't demolish the test file), you may want to open $file but save to $file2 just incase the unexpected happens..
      I agree, and will update my node to do so.
      How exactly isn't this going to treat HTML correctly? ... It's not interpreting the file as HTML at all
      That's all I meant; it won't look for HTML tags, it will look for literal text, including what it finds in comments, script, etc.
      My question to you was, what exactly is line 5 doing with the joining, maping and sorting? You're playing with length which I thought only stored the length in characters of the item you're using it with.
      Sorting greatest length first ensures that the match will work if you have e.g. "<A " and "<A HREF". Without the sort, you get results like:
      $ perl use warnings; use strict; my @codes = ("<a ", "<a href"); my $codes_regex = join "|", map quotemeta $_, # sort { length $b <=> length $a } @codes; my $text = "testing a link: <A HREF=\"fooble.html\">boofle</a>"; print "in: $text\n"; $text =~ s/($codes_regex)/lc $1/gie; print "out: $text\n"; __END__ output with the sort: in: testing a link: <A HREF="fooble.html">boofle</a> out: testing a link: <a href="fooble.html">boofle</a> and without: in: testing a link: <A HREF="fooble.html">boofle</a> out: testing a link: <a HREF="fooble.html">boofle</a>
      This is because the perl regexes prefer the leftmost |'d alternative, even if it makes a shorter match.

      The map is just to apply the quotemeta; the join is to put | between tags.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://321177]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-03-28 18:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found