Re: Mangling HTML to protect content, and finding stolen HTML content

Measuring the frequency of and distance between keywords in particular contexts is widely used in detecting plagiarism, and that may be the way forward for you, coupled with some fuzzy word matching to pick out appropriation of certain keywords or stolen factual information.

You may wish to look at plagiarism.org, a paper at Georgetown on concordances used for text comparison, and Christian Queinnec plagiarism detection script plagiat

Comment on Re: Mangling HTML to protect content, and finding stolen HTML content