comment on

Hi ,

This is not really a perl question, but I 'm scratching my head for days to find a way to process my data with perl and the best way to achieve this.
I do not have much experience with perl but have already developed some processing scripts .
I have Mysql tables with a few million rows to process . In these tables I am trying to identify the records that are the same for me based on several criteria .
Each column in my extraction may be a criterion or not, significant or not, depending on other columns.

A piece of my extraction :

+----+-----+-------+--------+----+----+--------+----+--------+---+
| M  | m   | p     | k      | y  | my | r      | s  | o      | c |
+----+-----+-------+--------+----+----+--------+----+--------+---+
| 84 | 250 | 16700 |   4900 | 13 |  0 | 102124 | 23 |      0 | 0 |* si
+milar
| 84 | 250 | 17290 |   4905 | 13 |  6 | 102124 |  1 |   3687 | 0 |* si
+milar
| 84 | 250 | 17290 |   4905 | 13 |  6 | 102124 | 22 |   3687 | 2 |* si
+milar
| 84 | 250 | 17290 |   4910 | 13 |  6 | 102124 |  3 |   3687 | 2 |* si
+milar
| 84 | 250 | 16700 |   4900 | 13 |  6 | 102124 |  3 |      0 | 5 |* si
+milar
| 84 | 250 | 17290 |   4905 | 13 |  6 | 102124 |  4 |   3687 | 2 |* si
+milar
| 84 | 250 | 10200 |  46423 | 11 |  5 |  52012 | 23 |    485 | 1 |# si
+milar
| 84 | 250 | 10900 |  46423 | 11 |  5 |  52012 |  8 |    485 | 0 |# si
+milar
| 84 | 250 |  9900 |  46423 | 11 |  5 |  52012 | 22 |    485 | 1 |# si
+milar
| 84 | 250 | 10900 |  46423 | 11 |  5 |  52012 |  3 |    485 | 1 |# si
+milar
| 84 | 250 |  5200 | 150000 | 07 | 11 |  31609 |  8 |  54964 | 3 |& si
+milar
| 84 | 250 |  5490 | 150000 | 07 |  0 |      0 | 23 |  54964 | 0 |& si
+milar
| 84 | 250 |  5300 | 150000 | 07 | 11 |  31609 |  6 |  54964 | 0 |& si
+milar
| 84 | 250 | 14390 |  49501 | 11 |  5 |      0 | 22 | 140427 | 1 |§ si
+milar
| 84 | 250 | 13980 |  49501 | 11 |  5 |  31751 |  6 | 140427 | 0 |§ si
+milar
| 84 | 250 | 13980 |  49501 | 11 |  5 |  31751 |  3 | 140427 | 1 |§ si
+milar
| 84 | 250 | 14380 |  49501 | 11 |  5 |      0 | 23 | 140427 | 1 |§ si
+milar
| 84 | 250 | 14380 |  49501 | 11 |  5 |      0 |  1 | 140427 | 0 |§ si
+milar
+----+-----+-------+--------+----+----+--------+----+--------+---+
[download]

What approach would you have with this kind of problem ?
what would be the most practical ? optimized ?

Personally I thought: ( I 'm lost ... )

- Extract all my table ( in an Array or Hash or AoH or HoA ?)
- Referencing each column in a different array ( useful? construction with a foreach ? )
- For each record , for each column, record ids with similarity and cross at the end the results of each columns according to my criteria to find similar records
Maybe you had examples of projects with similar apparoche ? or maybe you are there packages facilitate this treatment?
Best regards ,

In reply to Find similar records based on multiple column with multiple criteria by ssc37

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.