Remove everything to the end of the BODY tag. Remove all tags, replacing images with their alt text. Then compare the start and end of each page to every other page. Remove material common between x number of pages that's more than x number of words in length (or some combination of the two). This will be the header and footer material.
What's left is the classic "longest substrings common between two pieces of text" problem. There was a discussion of that recently - let me see if I can find the thread...