The techniques of Text Mining sound applicable to your quandry. When I read your question, I thought immediately of an article I'd read:

Marc Damashek. Gauging Similarity with N-Grams: Language-Independent Categorization of Text. Science, Vol. 267, pp. 843-848, 10 February 1995.

Similar articles include:

Searching for text? Send an N-Gram! (includes related articles on implementing n-gram systems and n-gram vectors in C) (technical) Roy E. Kimbrell. Byte, May 1988 v13 n5 p297(9).

Abstract: N-gram indexing systems are the best method of retrieving information from large full-text databases. An n-gram is a sequence of a specified number of characters occurring in a word. N-gram vectors must be derived for each document stored in order to set up a document-retrieval system using n-grams. N-gram indexing is computationally less intensive than keyword solutions, the next best alternative. N-gram systems are adaptable to several different situations and systems do not need to be re-indexed to answer completely new questions. N-grams are limited in that they are complicated, is memory- and processor-intensive, and is not exact {sic}.

and:

"One Size Fits All? A Simple Technique to Perform Several NLP Tasks." by Daniel Gayo-Avello, Darío Álvarez-Gutiérrez, and José Gayo-Avello.

There are several Perl packages for working with N-Grams; you can search CPAN for them.

I realize this is only a possible pointer in the right direction, but hope it helps.

Nancy

Addendum: This might be the best answer to your question: http://www.perlmonks.org/index.pl?node_id=32285


In reply to Re: Fingerprinting text documents for approximate comparison by planetscape
in thread Fingerprinting text documents for approximate comparison by Mur

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.