I was curious to see if this could be done in a one-liner. The following is untested.
Though one shouldn't rely on regexp's to parse HTML, it's tempting in this narrowly defined situation to resort to a one liner:
perl -pi -00 -e 's{(<TITLE.*?>).*?(<\/TITLE>)}[$1newtitle$2]gis' *.htm +l
The premise is that the -p option wraps a
while(<>) { ..... }
loop around the program. <> reads files that are listed on the command line (in this case *.html).
And the -i switch tells the program to edit the files in place.
The -00 command line switch changes the record separator to "\n\n" (paragraph mode) and will tell the program to read in chunks of the file separated by two newlines. This is somewhat arbitrary, and you could probably just as easily tell it that "" is the delimeter. The point is to catch situations where the two tags being matched span multiple lines. This method will still get tripped up if there are cases where the tags span two or more new lines in a row.
And the /s modifier on the substitution regexp tells the regexp engine to treat newlines like any other character. The /g modifier on the regexp tells the engine to look for all occurrences. That may not be necessary if you know that the HTML doesn't have more than one set of <title></title> tags. And the /i modifier tells the regexp engine to treat <Title>, <tItLe>, and so on, all the same (ignore case). The .*? inside of the <title> tag allows for whitespace and comments, and other things within the title tag. And the .*? between tags is a non-greedy match of the text between the title tags.
The shortcomings will be: Any <title>....</title> that contains two or more newlines embedded anywhere between the tags will result in the match failing and the title not being replaced. Nested title tags would also be a problem. And title tags (even if somehow escaped) within the title would also throw things off track. And I'm sure that there are others. But if you want quick, dirty, and if you know that the HTML doesn't have multiple newlines next to each other within the title tags, this may do the trick.
To illustrate the effectiveness of this technique, I used the following one liner to convert all of the #!/perl/bin/perl "Shebangs" from all Perl scripts in one directory on my Windows PC to #!/usr/bin/perl "Shebangs" so that I could use those scripts on my Linux PC.
perl -pi -e 's{^#!\/.*?$}[#!\/usr\/bin\/perl];' *.pl
At first I was thinking that using s{...}[...] instead of s/// would eliminate the need to escape the /, but after trying it without escaping the slash, I found that I still had to do so for it to work. But nevertheless, one quick one-liner converted dozens of files' shebang lines for me. With just a little more creativity I could also have preserved any shebang-line perl flags such as -w (warnings). Though with these particular files that was unnecessary.
Dave
"If I had my life to do over again, I'd be a plumber." -- Albert Einstein
In reply to Re: Rename html page titles
by davido
in thread Rename html page titles
by Concept99
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |