in reply to Fingerprinting text documents for approximate comparison
The techniques of Text Mining sound applicable to your quandry. When I read your question, I thought immediately of an article I'd read:
Marc Damashek. Gauging Similarity with N-Grams: Language-Independent Categorization of Text. Science, Vol. 267, pp. 843-848, 10 February 1995.
Similar articles include:
Searching for text? Send an N-Gram! (includes related articles on implementing n-gram systems and n-gram vectors in C) (technical) Roy E. Kimbrell. Byte, May 1988 v13 n5 p297(9).
Abstract: N-gram indexing systems are the best method of retrieving information from large full-text databases. An n-gram is a sequence of a specified number of characters occurring in a word. N-gram vectors must be derived for each document stored in order to set up a document-retrieval system using n-grams. N-gram indexing is computationally less intensive than keyword solutions, the next best alternative. N-gram systems are adaptable to several different situations and systems do not need to be re-indexed to answer completely new questions. N-grams are limited in that they are complicated, is memory- and processor-intensive, and is not exact {sic}.
and:
"One Size Fits All? A Simple Technique to Perform Several NLP Tasks." by Daniel Gayo-Avello, Darío Álvarez-Gutiérrez, and José Gayo-Avello.
There are several Perl packages for working with N-Grams; you can search CPAN for them.
I realize this is only a possible pointer in the right direction, but hope it helps.
Nancy
Addendum: This might be the best answer to your question: http://www.perlmonks.org/index.pl?node_id=32285
|
|---|