in reply to Re^4: Cluster a big bunch of strings
in thread Cluster a big bunch of strings

citromatik,
Ok, this now sounds more feasible. I would take a multi-phase approach with multiple passes. I use "phase" to mean logical unit of work and I use "pass" to mean processing the same title another time.

Pass 1

Phase 1: Text Normalization

This phase is designed to standardize the text to allow exact matching. You will need to determine what steps are appropriate for your articles but here are some ideas:

Phase 2: Tag Titles With Meta Data

This phase is designed to record information about the title that will help you to optimize comparisons in the next pass. Again, you will need to determine what exact data will assist you, but here are some ideas:

A note on the the last item - required keywords: This is more for finding related articles then exact matches. Perhaps you can use it to your advantage if you ensure every normalized token matches a known token. The idea is to identify which words are absolutely necessary for two titles to be related. This won't help you if the keyword itself is mispelled and you didn't correct it.

Pass 2

Phase 1: Identify Unmatched Titles

Any title that has a count of 1 after the first pass is a candidate for being an unmatched article. It may in fact be unique, but you want to verify. Hopefully the first pass reduced your list of candidates considerably.

Phase 2: Break Your Work Into Groups

Each unmatched title from the previous phase will need to be compared against other titles, but they needn't be compared against all titles. Use the count and length meta data in pass1/phase2 as well as keyword(s) if you decide to use it. It is unlikely a title containing 15 tokens after text normalization is the same as a title with only 9.

What you have now is a group of N unmatched titles that have similar attributes (count, size, keywords). These titles will all need to be compared to each other which require (N^2 - N) / 2 comparisons. Additionally, each unmatched title will need to be compared against each matched title M - which is M * N more comparisons. The important thing to realize is that you are not matching matched titles against matched titles.

Phase 3: Compare

Here you should first use the soundex/metaphone data as an exact match before falling back to an edit distance match (Text::Levenshtein_XS for instance). Anything unmatched at this point should be considered a truly unique title.

Pass 3 (New Titles)

Bookkeeping

You should keep track of information as you go that will help you refine your process further. For instance, every time you make a match in pass2/phase3 - it means that there was a missed opportunity in pass1/phase1. Perhaps you have a stopword that shouldn't be or missed a common typo, etc. Additionally, if a title has been deemed truly unique, there is no a reason to compare it to titles that existed at that time ever again. The only noteable exception would be if you went and re-populated the meta data as a result of refinements. The point of this bookkeeping is to only pay the hefty price once up front.

Phase 1: Wash-Rinse-Repeat

All phases of passes 1 and 2 will need to be done for new titles added. These should go much faster now for a number of reasons.

I am soliciting comments and feedback in the CB so there may be significant updates. I will annotate any as appropriate.

Cheers - L~R