What about compressing the document with something like huffman encoding which would then shorten all of the words, replacing them with keys for repeated instances so that would really compress the text. You could even then compare the 'keys' it uses as replacements for comparison of like text. Going further, but this might be pushing it is just store the keys it uses (i.e. the header from the compression) as these would be replaced based on frequency of use and then you could eliminate all of the short words then.
Just a thought.. =)