Find similar records based on multiple column with multiple criteria

ssc37 has asked for the wisdom of the Perl Monks concerning the following question:

Hi ,

This is not really a perl question, but I 'm scratching my head for days to find a way to process my data with perl and the best way to achieve this.
I do not have much experience with perl but have already developed some processing scripts .
I have Mysql tables with a few million rows to process . In these tables I am trying to identify the records that are the same for me based on several criteria .
Each column in my extraction may be a criterion or not, significant or not, depending on other columns.

A piece of my extraction :

+----+-----+-------+--------+----+----+--------+----+--------+---+
| M  | m   | p     | k      | y  | my | r      | s  | o      | c |
+----+-----+-------+--------+----+----+--------+----+--------+---+
| 84 | 250 | 16700 |   4900 | 13 |  0 | 102124 | 23 |      0 | 0 |* si
+milar
| 84 | 250 | 17290 |   4905 | 13 |  6 | 102124 |  1 |   3687 | 0 |* si
+milar
| 84 | 250 | 17290 |   4905 | 13 |  6 | 102124 | 22 |   3687 | 2 |* si
+milar
| 84 | 250 | 17290 |   4910 | 13 |  6 | 102124 |  3 |   3687 | 2 |* si
+milar
| 84 | 250 | 16700 |   4900 | 13 |  6 | 102124 |  3 |      0 | 5 |* si
+milar
| 84 | 250 | 17290 |   4905 | 13 |  6 | 102124 |  4 |   3687 | 2 |* si
+milar
| 84 | 250 | 10200 |  46423 | 11 |  5 |  52012 | 23 |    485 | 1 |# si
+milar
| 84 | 250 | 10900 |  46423 | 11 |  5 |  52012 |  8 |    485 | 0 |# si
+milar
| 84 | 250 |  9900 |  46423 | 11 |  5 |  52012 | 22 |    485 | 1 |# si
+milar
| 84 | 250 | 10900 |  46423 | 11 |  5 |  52012 |  3 |    485 | 1 |# si
+milar
| 84 | 250 |  5200 | 150000 | 07 | 11 |  31609 |  8 |  54964 | 3 |& si
+milar
| 84 | 250 |  5490 | 150000 | 07 |  0 |      0 | 23 |  54964 | 0 |& si
+milar
| 84 | 250 |  5300 | 150000 | 07 | 11 |  31609 |  6 |  54964 | 0 |& si
+milar
| 84 | 250 | 14390 |  49501 | 11 |  5 |      0 | 22 | 140427 | 1 |§ si
+milar
| 84 | 250 | 13980 |  49501 | 11 |  5 |  31751 |  6 | 140427 | 0 |§ si
+milar
| 84 | 250 | 13980 |  49501 | 11 |  5 |  31751 |  3 | 140427 | 1 |§ si
+milar
| 84 | 250 | 14380 |  49501 | 11 |  5 |      0 | 23 | 140427 | 1 |§ si
+milar
| 84 | 250 | 14380 |  49501 | 11 |  5 |      0 |  1 | 140427 | 0 |§ si
+milar
+----+-----+-------+--------+----+----+--------+----+--------+---+
[download]

What approach would you have with this kind of problem ?
what would be the most practical ? optimized ?

Personally I thought: ( I 'm lost ... )

- Extract all my table ( in an Array or Hash or AoH or HoA ?)
- Referencing each column in a different array ( useful? construction with a foreach ? )
- For each record , for each column, record ids with similarity and cross at the end the results of each columns according to my criteria to find similar records
Maybe you had examples of projects with similar apparoche ? or maybe you are there packages facilitate this treatment?
Best regards ,

Comment on Find similar records based on multiple column with multiple criteria Download Code

Replies are listed 'Best First'.
Re: Find similar records based on multiple column with multiple criteria by Kenosis (Priest) on Dec 13, 2013 at 19:44 UTC
Can't you query MySQL "...to identify the records that are the same for me you based on several criteria..."? Edit: If you're more inclined to use Perl than MySQL queries, perhaps the following will provide a start: Read more... (5 kB)	[reply] [d/l] [select]
Re^2: Find similar records based on multiple column with multiple criteria by VincentK (Beadle) on Dec 13, 2013 at 20:26 UTC
Hello ssc37. I agree with Kenosis. It sounds to me like you should be able to run a query or multiple queries against your table to pull out the results that you want. Some table joins may be in order to accomplish this. Maybe these references will help. http://www.w3resource.com/sql/joins/join-a-table-to-itself.php http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html	[reply]
Re: Find similar records based on multiple column with multiple criteria by erix (Prior) on Dec 13, 2013 at 21:27 UTC
It's not clear what you want done: you do not define similarity. FWIW, the provided data can be grouped using only column k, by a simple `floor( k / 100.0 ) * 100` expression. How can more be deduced from your data? select mm, m , p , k , y , my, r , s , o , c, whatness from t order by floor( k / 100.0 ) * 100 , k ; mm \| m \| p \| k \| y \| my \| r \| s \| o \| c \| what +ness ----+-----+-------+--------+----+----+--------+----+--------+---+----- +------ 84 \| 250 \| 16700 \| 4900 \| 13 \| 0 \| 102124 \| 23 \| 0 \| 0 \| * si +milar 84 \| 250 \| 16700 \| 4900 \| 13 \| 6 \| 102124 \| 3 \| 0 \| 5 \| * si +milar 84 \| 250 \| 17290 \| 4905 \| 13 \| 6 \| 102124 \| 1 \| 3687 \| 0 \| * si +milar 84 \| 250 \| 17290 \| 4905 \| 13 \| 6 \| 102124 \| 22 \| 3687 \| 2 \| * si +milar 84 \| 250 \| 17290 \| 4905 \| 13 \| 6 \| 102124 \| 4 \| 3687 \| 2 \| * si +milar 84 \| 250 \| 17290 \| 4910 \| 13 \| 6 \| 102124 \| 3 \| 3687 \| 2 \| * si +milar 84 \| 250 \| 10900 \| 46423 \| 11 \| 5 \| 52012 \| 3 \| 485 \| 1 \| # si +milar 84 \| 250 \| 10200 \| 46423 \| 11 \| 5 \| 52012 \| 23 \| 485 \| 1 \| # si +milar 84 \| 250 \| 10900 \| 46423 \| 11 \| 5 \| 52012 \| 8 \| 485 \| 0 \| # si +milar 84 \| 250 \| 9900 \| 46423 \| 11 \| 5 \| 52012 \| 22 \| 485 \| 1 \| # si +milar 84 \| 250 \| 14380 \| 49501 \| 11 \| 5 \| 0 \| 1 \| 140427 \| 0 \| § si +milar 84 \| 250 \| 13980 \| 49501 \| 11 \| 5 \| 31751 \| 6 \| 140427 \| 0 \| § si +milar 84 \| 250 \| 13980 \| 49501 \| 11 \| 5 \| 31751 \| 3 \| 140427 \| 1 \| § si +milar 84 \| 250 \| 14380 \| 49501 \| 11 \| 5 \| 0 \| 23 \| 140427 \| 1 \| § si +milar 84 \| 250 \| 14390 \| 49501 \| 11 \| 5 \| 0 \| 22 \| 140427 \| 1 \| § si +milar 84 \| 250 \| 5490 \| 150000 \| 7 \| 0 \| 0 \| 23 \| 54964 \| 0 \| & si +milar 84 \| 250 \| 5300 \| 150000 \| 7 \| 11 \| 31609 \| 6 \| 54964 \| 0 \| & si +milar 84 \| 250 \| 5200 \| 150000 \| 7 \| 11 \| 31609 \| 8 \| 54964 \| 3 \| & si +milar (18 rows) [download]	[reply] [d/l] [select]
Re^2: Find similar records based on multiple column with multiple criteria by ssc37 (Acolyte) on Dec 15, 2013 at 02:07 UTC
This is not as simple as querying Mysql ... Records that are similar are determined by many factors. So I need to pass the information to validate or not the fact that several lines should be considered similar. I do not want to overload the server with thousands of queries, or create tables with the memory storage engine. When I thought, in fact, what I need is to be able to recover the data I have to deal with the MySQL server, keep them in memory on the server that executed the script, and access such a simple way as Mysql. A kind of NoSQL inside perl actually, to manipulate my data on the fly, create temporary tables etc. .. The only thing I found is that looks like this: http://search.cpan.org/ ~ vladb/DBIx-DataLookup-0.03/DataLookup/DataLookup.pm but this no longer seems maintained Any suggestions?	[reply]
Re^3: Find similar records based on multiple column with multiple criteria by erix (Prior) on Dec 15, 2013 at 05:41 UTC
It still doesn't sound very clear... Guesswork on my part, but perhaps Algorithm::Cluster, based on Michiel de Hoon's cluster program is useful? A commandline --help page is here; and the paper. If all else fails there is a clustering bibliography as well.	[reply]
Re: Find similar records based on multiple column with multiple criteria by VincentK (Beadle) on Dec 16, 2013 at 14:27 UTC
Hi ssc37. Do you have all of your match conditions defined? If so, what are they? It would be helpful to know what you are using to consider records similar. I loaded your data into a temporary table and queried three columns as follows. `select mm,m,k , COUNT() as RecCount from [dbo].[testing_data] group by mm,m,k having COUNT() > 1` [download] I know this isn't the Perl solution you are after, but does this help?	[reply] [d/l]