comment on

A simple program to assist you to sytematically identify what sort of changes occured between that last version of a document and this one. Specifically geared at picking up subtle changes like table cell widths, etc.
I was thinking of using more elaborate means to diff a couple of HTML documents but this serves my needs when they've (our friends the HTML designers) just been fiddling and the two docs are basically the same.

#!/usr/bin/perl
# -w not used because of a few noisy warnings in write's

# tag_comp.pl
#  - jlawrenc@infonium.com - use at your own risk
#
# A quick 'n dirty to help you compare HTML tags across two similar do
+cuments.
#
# This happens to me from time to time. We have an HTML template that 
+has been
# adapted for server-side use. Then the graphic designer goes off and 
+reformats
# with different fonts, tag sizes or whatever. It could be easer to sc
+ope out the
# changes and then just re-edit our template document rather than rewo
+rking the
# supplied HTML back into a template.
#
# Invoke thusly:
#  tag_comp fn1 fn2 [tag [shift]]
#
# ie/
#  tag_comp index.html new_index.html table
#    generates a report of how the <table> tag is used differently bet
+ween the two
#    documents
#
#  tag_comp index.html new_index.html img 2
#    a report of how <img> tags have changed shifting the left col up 
+a couple
#    of rows to help line up the differences
#
#
#  Things to consider
#   a - tag regex is real simple "<" + not > 1 or more times + ">"
#       this may not always work for you
#   b - tag compares are lowercased
#
#  It would be nice to try and line up the matches more effectively bu
+t a humon
#  will do the job for now.


# Report header
format STDOUT_TOP =
----------------------------------------------------------------------
+-----------
@|||||||||||||||||||||||||||||||||||||| | @|||||||||||||||||||||||||||
+|||||||||||
$fn1, $fn2
----------------------------------------------------------------------
+-----------
.

# Report body - lines that do not match
format STDOUT =
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~ | ^<<<<<<<<<<<<<<<<<<<<<<<<<<<
+<<<<<<<<<~~
$srch1[$i], $srch2[$i]
----------------------------------------------------------------------
+-----------
.

# Report body - lines that do match
format STDOUT_MATCH =
 * match: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
+<<<<<<<<<~~
$srch1[$i]
----------------------------------------------------------------------
+-----------
.


# Our input arguments - file name1, file name2, tag to report on, shif
+t value
($fn1, $fn2, $tag, $shift) = @ARGV;
if (!$fn1 or !$fn2) {
  die "Please supply two file names to compare.";
}

# Default to "img" tags
if (!$tag) {
  $tag="img";
  print STDERR "Defaulting to search for <$tag>s\n\n";
}

# Check for positive shift
if ($shift<0) {
  print STDERR "shift only works with positive vals.\n";
  print STDERR "if you want to shift the other way then try reversing 
+your file names. :)\n";
}

# Slurp our files
undef $/;
open FIN, $fn1;
$file1=<FIN>;
open FIN, $fn2;
$file2=<FIN>;

# Grab our tags - real crude regex that may not always do the trick
while ($file1 =~ /(<[^>]+>)/gms) {
  push @tags1, $1;
}

while ($file2 =~ /(<[^>]+>)/gms) {
  push @tags2, $1;
}

# Get our list of matching tags
@srch1=grep /^<$tag(\s|>)/i, @tags1;
@srch2=grep /^<$tag(\s|>)/i, @tags2;

# Shift first search result if needed
for ($i=0; $i<$shift; $i++) { unshift @srch1, ""; }

# Find out who has more rows - set1 or 2
$rows=$#srch1 > $#srch2 ? $#srch1 : $#srch2;

# Write our header
$~="STDOUT_TOP";
write;

# Write report body
foreach ($i=0; $i<=$rows; $i++) {

  # One format for rows that are the same, another for those that are 
+not
  if (lc $srch1[$i] ne lc $srch2[$i]) {
    $~="STDOUT";
    write;
  } else {
    $~="STDOUT_MATCH";
    write;
  }

}

# Done - coffee time
[download]

In reply to HTML tag compares between similar files by jlawrenc

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.