Concept99 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to find a way to mass rename all of the PAGE titles in my html documents. As of now, I have found the following code that lists all the html files and their titles, but I have been having trouble opening the html files for append. Once I do that, I am still unsure of how to change the TITLE tag.
#!/usr/local/bin/perl foreach $file (<*.htm*>) { open(HTFILE, ">>$file") or die ("can't open $file"); while (<HTFILE>) { chop; $notitle = 1; if (/^\<TITLE\>/) { s/<TITLE>//; s/<\/TITLE>//; print; $notitle = 0; last; # break out of this loop (file) } if (/^\<title\>/) { s/<title>//; s/<\/title>//; print; $notitle = 0; last; # break out of this loop (file) } } if ($notitle == 1) { print "*** No title found *** " }; close(HTFILE); print " ($file)\n"; }

Any ideas?

Replies are listed 'Best First'.
Re: Rename html page titles
by Ovid (Cardinal) on Sep 03, 2003 at 19:25 UTC

    This code should get you started.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new(\*DATA); my $new_title = 'Some title'; my $html = ''; while (defined (my $token = $parser->get_token)) { if ($token->is_start_tag('title')) { $html .= $token->as_is . $new_title; $token = $parser->get_tag('/title'); # advance to last title t +ag; } $html .= $token->as_is; } print $html; __DATA__ <html> <head> <title> This is a title </title> </head> <body> <p>This is the body</p> </body> </html>

    Cheers,
    Ovid

    New address of my CGI Course.

      I've been cruising these forums long enough to spot a question that's going to get a reply of "use a module." merlyn beat me to it and Ovid actually showed it. As a writer of fairly utilitarian Perl code (simple forms processing and MySQL for dynamic pages) I'm a little nervous about all this module talk. Having said that, I'm using and loving HTML::Template and GD.

      If anyone has a suggestion of where to go to get/read more elementary stuff on the scary modules (Programming Perl is too scant--more on how to create them), and which ones do what and simple tutorials on how to use them, we module neophytes would stop posting questions about parsing HTML.

      As one still new to the monastery, I'm open.

      Thanks module-saavy monks!
        If anyone has a suggestion of where to go to get/read more elementary stuff on the scary modules (Programming Perl is too scant--more on how to create them), and which ones do what and simple tutorials on how to use them, we module neophytes would stop posting questions about parsing HTML.
        Well, I have over 150 suggestions for you. Use the search box at the bottom of any page there to narrow it down a bit to a particular topic area.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

•Re: Rename html page titles
by merlyn (Sage) on Sep 03, 2003 at 19:39 UTC
    Besides the other useful suggestions I've already seen here, I've not seen anybody yet say "use this as an opportunity to never have to do it again—install a templating system now!".

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Rename html page titles
by tcf22 (Priest) on Sep 03, 2003 at 19:21 UTC
    if <TITLE> and </TITLE> are on the same line always this would work
    if(/<title>(.+)<\/title>/i){ print $1; $notitle = 0; last; }
    As for opening the file for appending, that probably isn't what you want to do. Open the source file and a temp file. Read from the source file, then write out each line(modified/unmodified) to the temp file. unlink() the source file, and rename the temp file to the original filename.
    foreach my $file(@files){ open(TMP, '>/tmp/temp.html') || die "tmp file open failed:$!\n"; open(FILE, "$file") || die "source file open failed: $!\n"; while(<FILE>){ s/<title>(.+)<\/title>/<title>$new_title<\/title>/i; print TMP $_; } close(TMP); close(FILE); unlink($file); rename('/tmp/temp.html', $file); }
Re: Rename html page titles
by davido (Cardinal) on Sep 03, 2003 at 19:30 UTC
    In your original question you mentioned that you were having trouble opening the file for append. That in and of itself is a problem: If you try to append, you'll be tacking on the new title at the end of the file, which isn't what you want. You probably want to be editing inline.

    I was curious to see if this could be done in a one-liner. The following is untested.

    Though one shouldn't rely on regexp's to parse HTML, it's tempting in this narrowly defined situation to resort to a one liner:

    perl -pi -00 -e 's{(<TITLE.*?>).*?(<\/TITLE>)}[$1newtitle$2]gis' *.htm +l

    The premise is that the -p option wraps a

    while(<>) { ..... }

    loop around the program. <> reads files that are listed on the command line (in this case *.html).

    And the -i switch tells the program to edit the files in place.

    The -00 command line switch changes the record separator to "\n\n" (paragraph mode) and will tell the program to read in chunks of the file separated by two newlines. This is somewhat arbitrary, and you could probably just as easily tell it that "" is the delimeter. The point is to catch situations where the two tags being matched span multiple lines. This method will still get tripped up if there are cases where the tags span two or more new lines in a row.

    And the /s modifier on the substitution regexp tells the regexp engine to treat newlines like any other character. The /g modifier on the regexp tells the engine to look for all occurrences. That may not be necessary if you know that the HTML doesn't have more than one set of <title></title> tags. And the /i modifier tells the regexp engine to treat <Title>, <tItLe>, and so on, all the same (ignore case). The .*? inside of the <title> tag allows for whitespace and comments, and other things within the title tag. And the .*? between tags is a non-greedy match of the text between the title tags.

    The shortcomings will be: Any <title>....</title> that contains two or more newlines embedded anywhere between the tags will result in the match failing and the title not being replaced. Nested title tags would also be a problem. And title tags (even if somehow escaped) within the title would also throw things off track. And I'm sure that there are others. But if you want quick, dirty, and if you know that the HTML doesn't have multiple newlines next to each other within the title tags, this may do the trick.

    To illustrate the effectiveness of this technique, I used the following one liner to convert all of the #!/perl/bin/perl "Shebangs" from all Perl scripts in one directory on my Windows PC to #!/usr/bin/perl "Shebangs" so that I could use those scripts on my Linux PC.

    perl -pi -e 's{^#!\/.*?$}[#!\/usr\/bin\/perl];' *.pl

    At first I was thinking that using s{...}[...] instead of s/// would eliminate the need to escape the /, but after trying it without escaping the slash, I found that I still had to do so for it to work. But nevertheless, one quick one-liner converted dozens of files' shebang lines for me. With just a little more creativity I could also have preserved any shebang-line perl flags such as -w (warnings). Though with these particular files that was unnecessary.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      Somthing like this is a bit more robust and writes a backup just in case it destroys stuff ;-) The [^<] class works well for REs in HTML where you want to match potential multilines. There are of course edge cases where < could be valid content between tags. You deal with them with a forward lookahead ([^<]+|<(?!\s*/))* and alternation.

      This sort of thing is a quick and dirty solution, the parsing modules or templating are better.

      perl -pi.bak -e 's#<\s*title\s*>([^<]+|<(?!\s*/))*<\s*/\s*title\s*>#<t +itile>New Title</title>#i' <files> # this will correctly parse horrid stuff like <TiTle > One is < two more stuff </ title> <foo>bar</foo>

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Rename html page titles
by Popcorn Dave (Abbot) on Sep 03, 2003 at 19:18 UTC
    Your immediate problem is that you're never writing anything to the file. You're going to need to do a

    print HTFILE <changed title here>

    to change the title. Even if you want to change it to nothing, you still need to write to the file.

    That said, you're probably going to need to open a new file to write the entire output to once you've made your changes.

    You could do it by modifying $/ and slurping the entire file in, doing a quick s/// on the string and then write the result to a file, but I'd stick with the other way if there aren't too many files to change.

    Hope that helps!

    There is no emoticon for what I'm feeling now.