For example, we have 5 comma separated data file with 20 fields and we have been asked to show all variants contained in each of 3 of the fields over the 5 files in the least number of records.
Variants of what? Field values? Field value combinations? Variants relative to a canonical set of values? This isn't very clear to me. Do you mean:
- The smallest sample of records such that each and every value from field X appears in at least one of the sample records? i.e. combinations of values don't matter. We only require that the full domain of X, the full domain of Y, and the full domain of Z be represented.
- The smallest sample of records such that each and every canonical value from field X appears in at least one of the sample records? (this would be a smaller set than the previous sample)
- The smallest sample of records such that all distinct combinations of values from fields X,Y,Z are included?
Also, do you really want the smallest possible sampling of records or just to get rid of records for which all values have already been seen? Consider the sample, [A1,B1],[A1,B2],[A2,B1],[A2,B2]. If you only want to get rid of all records where all values have been seen then, assuming you don't care about combination, <code>[A1,B1],[A1,B2],[A2,B1] (a la CountZero's solution) is ok. If you really want the minimum number of records, then the right solution would be [A1,B1],[A2,B2] as pointed out by Kyle, and we've got a much more difficult optimization problem.
Best, beth
Update: - removed code sample due to failure to solve problem raised by Kyle and added a question about what the OP really means by "least".
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.