ssc37 has asked for the wisdom of the Perl Monks concerning the following question:

Hi ,

This is not really a perl question, but I 'm scratching my head for days to find a way to process my data with perl and the best way to achieve this.
I do not have much experience with perl but have already developed some processing scripts .
I have Mysql tables with a few million rows to process . In these tables I am trying to identify the records that are the same for me based on several criteria .
Each column in my extraction may be a criterion or not, significant or not, depending on other columns.

A piece of my extraction :

+----+-----+-------+--------+----+----+--------+----+--------+---+ | M | m | p | k | y | my | r | s | o | c | +----+-----+-------+--------+----+----+--------+----+--------+---+ | 84 | 250 | 16700 | 4900 | 13 | 0 | 102124 | 23 | 0 | 0 |* si +milar | 84 | 250 | 17290 | 4905 | 13 | 6 | 102124 | 1 | 3687 | 0 |* si +milar | 84 | 250 | 17290 | 4905 | 13 | 6 | 102124 | 22 | 3687 | 2 |* si +milar | 84 | 250 | 17290 | 4910 | 13 | 6 | 102124 | 3 | 3687 | 2 |* si +milar | 84 | 250 | 16700 | 4900 | 13 | 6 | 102124 | 3 | 0 | 5 |* si +milar | 84 | 250 | 17290 | 4905 | 13 | 6 | 102124 | 4 | 3687 | 2 |* si +milar | 84 | 250 | 10200 | 46423 | 11 | 5 | 52012 | 23 | 485 | 1 |# si +milar | 84 | 250 | 10900 | 46423 | 11 | 5 | 52012 | 8 | 485 | 0 |# si +milar | 84 | 250 | 9900 | 46423 | 11 | 5 | 52012 | 22 | 485 | 1 |# si +milar | 84 | 250 | 10900 | 46423 | 11 | 5 | 52012 | 3 | 485 | 1 |# si +milar | 84 | 250 | 5200 | 150000 | 07 | 11 | 31609 | 8 | 54964 | 3 |& si +milar | 84 | 250 | 5490 | 150000 | 07 | 0 | 0 | 23 | 54964 | 0 |& si +milar | 84 | 250 | 5300 | 150000 | 07 | 11 | 31609 | 6 | 54964 | 0 |& si +milar | 84 | 250 | 14390 | 49501 | 11 | 5 | 0 | 22 | 140427 | 1 |§ si +milar | 84 | 250 | 13980 | 49501 | 11 | 5 | 31751 | 6 | 140427 | 0 |§ si +milar | 84 | 250 | 13980 | 49501 | 11 | 5 | 31751 | 3 | 140427 | 1 |§ si +milar | 84 | 250 | 14380 | 49501 | 11 | 5 | 0 | 23 | 140427 | 1 |§ si +milar | 84 | 250 | 14380 | 49501 | 11 | 5 | 0 | 1 | 140427 | 0 |§ si +milar +----+-----+-------+--------+----+----+--------+----+--------+---+
What approach would you have with this kind of problem ?
what would be the most practical ? optimized ?

Personally I thought: ( I 'm lost ... )

- Extract all my table ( in an Array or Hash or AoH or HoA ?)
- Referencing each column in a different array ( useful? construction with a foreach ? )
- For each record , for each column, record ids with similarity and cross at the end the results of each columns according to my criteria to find similar records
Maybe you had examples of projects with similar apparoche ? or maybe you are there packages facilitate this treatment?
Best regards ,
  • Comment on Find similar records based on multiple column with multiple criteria
  • Download Code

Replies are listed 'Best First'.
Re: Find similar records based on multiple column with multiple criteria
by Kenosis (Priest) on Dec 13, 2013 at 19:44 UTC

    Can't you query MySQL "...to identify the records that are the same for me you based on several criteria..."?

    Edit: If you're more inclined to use Perl than MySQL queries, perhaps the following will provide a start:

Re: Find similar records based on multiple column with multiple criteria
by erix (Prior) on Dec 13, 2013 at 21:27 UTC

    It's not clear what you want done: you do not define similarity.

    FWIW, the provided data can be grouped using only column k, by a simple floor( k / 100.0 ) * 100 expression. How can more be deduced from your data?

    select mm, m , p , k , y , my, r , s , o , c, whatness from t order by floor( k / 100.0 ) * 100 , k ; mm | m | p | k | y | my | r | s | o | c | what +ness ----+-----+-------+--------+----+----+--------+----+--------+---+----- +------ 84 | 250 | 16700 | 4900 | 13 | 0 | 102124 | 23 | 0 | 0 | * si +milar 84 | 250 | 16700 | 4900 | 13 | 6 | 102124 | 3 | 0 | 5 | * si +milar 84 | 250 | 17290 | 4905 | 13 | 6 | 102124 | 1 | 3687 | 0 | * si +milar 84 | 250 | 17290 | 4905 | 13 | 6 | 102124 | 22 | 3687 | 2 | * si +milar 84 | 250 | 17290 | 4905 | 13 | 6 | 102124 | 4 | 3687 | 2 | * si +milar 84 | 250 | 17290 | 4910 | 13 | 6 | 102124 | 3 | 3687 | 2 | * si +milar 84 | 250 | 10900 | 46423 | 11 | 5 | 52012 | 3 | 485 | 1 | # si +milar 84 | 250 | 10200 | 46423 | 11 | 5 | 52012 | 23 | 485 | 1 | # si +milar 84 | 250 | 10900 | 46423 | 11 | 5 | 52012 | 8 | 485 | 0 | # si +milar 84 | 250 | 9900 | 46423 | 11 | 5 | 52012 | 22 | 485 | 1 | # si +milar 84 | 250 | 14380 | 49501 | 11 | 5 | 0 | 1 | 140427 | 0 | § si +milar 84 | 250 | 13980 | 49501 | 11 | 5 | 31751 | 6 | 140427 | 0 | § si +milar 84 | 250 | 13980 | 49501 | 11 | 5 | 31751 | 3 | 140427 | 1 | § si +milar 84 | 250 | 14380 | 49501 | 11 | 5 | 0 | 23 | 140427 | 1 | § si +milar 84 | 250 | 14390 | 49501 | 11 | 5 | 0 | 22 | 140427 | 1 | § si +milar 84 | 250 | 5490 | 150000 | 7 | 0 | 0 | 23 | 54964 | 0 | & si +milar 84 | 250 | 5300 | 150000 | 7 | 11 | 31609 | 6 | 54964 | 0 | & si +milar 84 | 250 | 5200 | 150000 | 7 | 11 | 31609 | 8 | 54964 | 3 | & si +milar (18 rows)
      This is not as simple as querying Mysql ...
      Records that are similar are determined by many factors.
      So I need to pass the information to validate or not the fact that several lines should be considered similar.
      I do not want to overload the server with thousands of queries, or create tables with the memory storage engine.
      When I thought, in fact, what I need is to be able to recover the data I have to deal with the MySQL server, keep them in memory on the server that executed the script, and access such a simple way as Mysql. A kind of NoSQL inside perl actually, to manipulate my data on the fly, create temporary tables etc. ..
      The only thing I found is that looks like this:
      http://search.cpan.org/ ~ vladb/DBIx-DataLookup-0.03/DataLookup/DataLookup.pm but this no longer seems maintained
      Any suggestions?
Re: Find similar records based on multiple column with multiple criteria
by VincentK (Beadle) on Dec 16, 2013 at 14:27 UTC
    Hi ssc37.

    Do you have all of your match conditions defined? If so, what are they? It would be helpful to know what you are using to consider records similar.

    I loaded your data into a temporary table and queried three columns as follows.

    select mm,m,k , COUNT(*) as RecCount from [dbo].[testing_data] group by mm,m,k having COUNT(*) > 1

    I know this isn't the Perl solution you are after, but does this help?