Bulk HTML Munging

Category:	HTML Utility
Author/Contact Info	Ovid
Description:	At my last job, we had a problem whereby many static HTML documents needed their footer replaced with a server side include, but the documents had been coded by hand and the HTML in the footers was very irregular. Because there were hundreds of documents, I wrote the following tool to allow for bulk matching and replacing of messy HTML. It's more powerful than you might think, so please read the POD for more information. The actual code is a wee bit sloppy as I had to get this written quickly. This is real, live production code from yours truly :) Many thanks to ONSITE! Technology, Inc. for giving me permission to release this as open source. (Trivia note: this program inspired $bad_names eq $bad_design)
#!/usr/local/bin/perl use warnings; use strict; use Cwd; use File::Copy; use Getopt::Long; use HTML::TokeParser::Simple 1.4; use Pod::Usage; my $VERBOSE = 1; GetOptions( 'help\|?' => sub { pod2usage(-verbose => 2);exit } , 'config=s' => \my $CONFIG, 'backup=s' => \my $BACKUP, 'verbose!' => \$VERBOSE, 'quiet' => sub { $VERBOSE = 0 }, 'debug' => \my $DEBUGGING, 'ordered' => \my $ORDERED_ATTRIBUTES, 'ignore' => \my $IGNORE_ATTRIBUTES, 'text' => \my $MATCH_TEXT, 'fuzzy' => \my $FUZZY_MATCH ); backup_dir( $BACKUP ); $BACKUP .= '/' unless substr( $BACKUP, -1 ) =~ /\//; my %CONFIG = read_config( $CONFIG ); if ( $DEBUGGING ) { no warnings 'once'; require Data::Dumper; print Data::Dumper->Dump([\%CONFIG], ['CONFIG']); } my %REPLACEMENTS; @ARGV = map { glob $_ } @ARGV; foreach my $file (@ARGV) { print "Processing ($file)...\n" if $VERBOSE; backup_file( $file, $BACKUP ); $REPLACEMENTS{$file} = 0; open HTML, '+<', $file or die "Can't open ($file) for updating: $!"; my $html = parse_document( HTML,$file ); print "\t$REPLACEMENTS{$file} replacement(s) made to ($file)\n" if $ +VERBOSE; if ( $REPLACEMENTS{$file} ) { seek HTML, 0, 0 or die "Can't seek to start of ($file): + $!"; print HTML $html; truncate HTML, tell(HTML) or die "Can't truncate ($file): $!"; } if ( $DEBUGGING ) { print $html; } close HTML; } sub parse_document { my ($fh,$file) = @_; my $parser = HTML::TokeParser::Simple->new( $fh ); my $html = ''; while ( my $token = $parser->get_token ) { if( stacks_match( $parser, $CONFIG{stack} ) ) { $html .= $CONFIG{new}; $REPLACEMENTS{$file}++; } else { $html .= $token->as_is; } } return $html; } sub stacks_match { # if the stack matches the current token stream, return true and lea +ve the # parser at the end of the stream match. If it doesn't match, set t +he # parser to its original state and return false. my ($parser,$stack) = @_; my $stacks_match = 1; my @current_stack; for my $i ( 0 .. $#$stack ) { my $token = $parser->get_token; unless ($token) { # we've reached the end of the document and thus cannot match $parser->unget_token(@current_stack); return; } push @current_stack => $token; $stacks_match = tokens_match($token,$stack->[$i]); unless ($stacks_match) { # stacks didn't match. Restore state and return if ( $DEBUGGING ) { print "\n* Current stack match failed:\n\n"; print Data::Dumper::Dumper(\@current_stack), "\n"; } $parser->unget_token(@current_stack); return; } } if ( $DEBUGGING ) { print "\n* Matched this stack against config stack:\n\n"; print Data::Dumper::Dumper(\@current_stack), "\n"; } return 1; } sub munge_text { my $text = shift; $text =~ s/\W//g; return lc $text; } sub tokens_match { my ($token,$stack_token) = @_; if ( $token->[0] ne $stack_token->[0] ) { # token types did not match return; } elsif ($token->is_tag) { return token_as_string($token) eq token_as_string($stack_token); } elsif ($MATCH_TEXT) { my $curr_text = $token->return_text; my $stack_text = $stack_token->return_text; if ($FUZZY_MATCH) { $curr_text = munge_text( $curr_text ); $stack_text = munge_text( $stack_text ); } if ( $curr_text =~ /\S/ or $stack_text =~ /\S/ ) { return $curr_text eq $stack_text; } } else { # we're ignoring whatever it is, so it's an automatic match return 1; } } sub read_config { my $file = shift; my %allowed = map {$_=>1} qw(old new); open CONFIG, "<", $file or die "Cannot read ($file): $!"; my %config; local $_; my ($section,$old_section) = ('',''); while (<CONFIG>) { next unless /\S/; if ( /^\s\[([^\]]+)\]\s$/ ) { # [$section] $section = $1; die "Unknown section ($section) in config file ($file)" unless exists $allowed{$section}; $config{$section} = ''; next; } $config{$section} .= $_; } chomp foreach values %config; close CONFIG; return add_stack( %config ); } sub add_stack { my %config = @_; my $html = $config{old}; $config{stack} = []; my $parser = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $parser->get_token ) { push @{$config{stack}} => $token; } delete $config{old}; return %config; } sub token_as_string { # so far, this is fairly simple. It merely "stringifies" the tag ty +pe # and attributes. This may change in the future. my $token = shift; my $sequence = $token->return_attrseq; return canonical_tag($token) unless $sequence && ! $IGNORE_ATTRIBUTE +S; @$sequence = sort @$sequence unless $ORDERED_ATTRIBUTES; my $attributes = $token->return_attr; my $results = ''; foreach my $attr (@$sequence) { $results .= $attr . $attributes->{$attr}; } return canonical_tag($token).$results; } sub backup_dir { my $dir = shift \|\| die pod2usage(); unless ( -d $dir ) { mkdir $dir or die "Could not makedir ($dir): $!"; } } sub canonical_tag { # prepends a backslash my $token = shift; my $tag = $token->return_tag; $tag = "/$tag" if $token->is_end_tag and '/' ne substr $tag, 0, + 1; return $tag; } sub backup_file { my ($file,$backup) = @_; if ( -e $file ) { copy( $file, "$backup$file" ) or die "Could not copy ($file) to ($backup): $!";\ return 1; } else { warn "\tWARNING: File ($file) does not exist in (",cwd,")\n"; return; } } __END__ =head1 NAME htmlreplace -- A simple HTML replacement tool =head1 SYNOPSIS B<htmlreplace --help> for more information htmlreplace [options] [filenames] Options: --help Display POD --? Same as --help --verbose List files while processing them (default) Will also list number of substitutions made. --noverbose Turn off --verbose --quiet Same as --noverbose --config file Location of config file --backup dir The directory to back up the files to --text Match text (default is off -- only check struct +ure) --fuzzy Same as --text, but matching is more robust (se +e below) --ignore Use this to ignore attributes --ordered If specified, attributes must appear in the sam +e order in both the config html and the target ht +ml. Default is unordered. --debug This will dump the config token stack to STDOUT +. =head1 OVERVIEW This program allows a the user to create a simple configuration file t +hat will define HTML snippets and the replacement text for them. Then, a list +of file names will be iterated over, checking the HTML and if any correspondin +g HTML is found, will replace the HTML as specified in the config file =head1 DESCRIPTION =head2 Configuration File The config file takes two tokes, which should be on lines by themselve +s, C<[old]> and C<[new]>. After the C<[new]> token, add the text that yo +u wish to replace the HTML with. After the C<[old]> token, add the HTML that th +e program must search for and replace. An example configuration file named I<copyright.cfg>: [old] <!--#include virtual="includes/copyright" --> [new] <table> <tr> <td class="copyright">© 2002 by Some Company, INC.</td> </tr> </table> This program parses the HTML into tokens, so whitespace is not importa +nt with tags. To use the above config file with a backup directory named I<old>: htmlreplace --config copyright.cfg --backup old .html Command line options may also be shortened to the smallest number of l +etters necessary to distinguish them from other options. Thus, the above can + be written as follows: htmlreplace -c copyright.cfg -b old .html =head2 Attribute handling Attributes are the name/value pairs associated with HTML start tags. +For example, the following tag has an attribute name of I<class> with a va +lue of I<foobar>. <p class="foobar"> Many tags will have multiple attributes. <img src="check.gif" height="13" width="12" alt="Check Gif"> By default, attributes of start tags do not need to appear in the same + order in the sample HTML and the HTML being examined. The following two C<i +mg> tags are equivalent: <img src="foo" height="13" width="13" alt="foo"> <img height="13" width="13" alt="foo" src="foo"> If attribute order is important, you may use the I<--ordered> switch ( +I<-o>). htmlreplace --ordered --config copyright.cgi --backup old .html If you wish to ignore attributes and simply ensure that the order of t +he tags is correct, use the I<--ignore> (I<-i>) switch. htmlreplace --ordered --ignore --config copyright.cgi --backup old . +html =head2 Matching Text By default, C<htmlreplace> only matches the structure of the document, + not the text. To match text, use the I<--text> (I<-t>) switch. This will ens +ure an exact match of the text (but skips any text that is pure whitespace). If the text might be a little off, such as unusual capitalization, ext +ra white space, etc., you can use the I<--fuzzy> (I<-f>) switch instead. This +will match text if the source and target texts match after all "non-word" (C</\W/>) characters are removed and all letters have been lower-cased +. =head1 COPYRIGHT Copyright (c) 2001 Curtis "Ovid" Poe. All rights reserved. This program is free software; you may redistribute it and/or modify i +t under the same terms as Perl itself.

Comment on Bulk HTML Munging Download Code

Replies are listed 'Best First'.
Re: Bulk HTML Munging by chanio (Priest) on Aug 14, 2003 at 03:57 UTC
Thank you 4 this very nice brainstorming + good lesson of good code 4 us, the newbies! The style resembles very much to *NIX's , also!	[reply]
Re: Re: Bulk HTML Munging by Ovid (Cardinal) on Aug 14, 2003 at 05:53 UTC
Thank you. However, if you're looking for great code, be warned that the above code used a lot of globals (well, lexicals whose scope covers the entire file). If later on I wanted to reword this into a proper module, those might cause me problems. Fortunately, they're always "read only" in the code and that minimizes things but global variables are still a good thing to avoid, if feasible. Also, my `read_config` was actually put in there so a project manager could easily use this code. In reality, there are better ways of getting the "old" and "new" data into the program. I really should turn this into an OO module and make those values that you can set in methods. Cheers, Ovid New address of my CGI Course.	[reply]