I am looking to do some pattern recognition in datasets. I have ~138,000 markers ID's that I identify with either a 1 (yes) or a 0 (no.) I have a collection of items that have their own sub-groups. I need to find out what markers are distinct for each sub-group, but different from the other groups. In the example below, Items 1,2 would be one sub-group, 3, 4 another, and 5,6 a third. (This dataset is a poor representation because of it's small size and these are all actually from one sub-group in my collection, so they should be fairly similar.

My goal is a script that I can run on updated collections with definable sub-groups that outputs a set number (user definable) list of marker ID's that are common to each sub group, but not the other subgroups. Ex, I want 10k ID's for each subgroup that are relevant to the subgroup, but not the others. I wasn't sure if this was something that Perl would be good for, and if it was, where I would even start.

Example data: ID Item 1 Item2 Item3 Item4 Item5 Item6 1 0 0 0 0 0 0 2 1 1 0 1 1 0 3 1 0 1 1 1 1 4 0 0 0 0 0 0 5 1 1 1 1 1 0 6 0 0 0 0 0 0 7 0 0 0 0 0 0 8 0 0 0 0 0 1 9 1 1 1 1 1 1 10 0 0 0 0 0 0 11 1 1 1 1 1 1 12 0 1 0 1 0 0 13 0 0 0 0 0 0 14 1 1 1 1 1 1 15 1 1 1 1 1 1 16 1 1 1 1 1 1 17 1 1 1 1 1 1 18 1 1 1 1 1 1 19 0 0 0 1 1 1 20 1 1 1 1 1 1 21 0 0 0 0 0 0 22 0 0 0 0 0 0 23 1 1 1 1 1 1 24 1 1 1 1 1 1 25 1 1 0 1 1 1 26 1 1 1 1 1 1 27 0 0 0 0 1 1 28 0 1 0 1 1 1 29 0 0 0 0 0 0 30 0 0 0 0 0 0 31 1 1 0 1 0 0 32 0 0 0 0 0 0 33 1 0 1 0 1 1 34 0 0 0 1 0 1 35 1 0 0 1 1 1 36 0 0 0 0 0 0 37 0 0 0 0 0 0 38 0 0 1 0 0 0 39 1 0 0 0 0 0 40 0 0 0 0 0 0 41 1 1 0 1 1 1 42 0 0 0 0 0 0 43 0 0 0 0 0 0 44 0 0 0 0 0 1 45 1 0 0 0 0 0 46 1 0 0 1 0 0 47 1 1 1 1 1 1 48 0 0 0 0 0 0 49 1 0 1 1 0 1 50 1 1 1 1 1 1 51 0 0 0 0 0 0 52 1 0 0 0 0 1 53 0 0 0 0 0 0 54 1 0 1 0 0 0 55 0 0 0 0 0 0 56 1 0 0 1 1 1 57 0 0 0 0 0 0 58 0 0 0 0 0 0 59 0 0 0 0 0 0 60 0 0 0 0 0 0 61 0 0 0 0 0 0 62 0 0 0 0 0 0 63 1 0 0 0 1 0 64 1 1 0 0 1 1 65 1 0 0 0 0 0 66 1 1 1 1 1 1 67 1 1 1 1 1 1 68 0 0 0 1 1 1 69 1 0 1 1 1 0 70 0 0 0 0 0 0 71 0 0 0 0 0 0 72 1 0 1 0 1 0 73 0 0 0 0 1 1 74 0 0 0 1 1 0 75 1 1 1 1 1 1 76 1 1 1 1 1 1 77 1 0 0 0 0 0 78 1 1 1 1 1 1 79 0 0 0 0 0 0 80 0 0 0 0 0 0 81 0 1 1 0 1 1 82 1 1 1 1 1 1 83 1 1 1 1 1 0 84 0 0 0 0 0 0 85 1 1 1 1 1 1 86 0 0 0 0 0 0 87 1 1 1 1 1 1 88 0 0 0 1 1 1 89 0 0 0 0 0 0 90 0 0 0 0 0 0 91 0 0 0 0 0 0 92 1 1 1 1 1 1 93 0 0 0 0 0 0 94 1 1 1 1 1 1 95 0 0 0 0 0 0 96 1 1 1 1 1 1 97 0 0 0 0 0 0 98 0 0 0 0 0 0 99 1 0 0 1 1 1 100 0 0 0 0 0 0 101 1 1 1 1 1 1 102 1 0 0 0 0 0 103 0 0 0 0 0 0 104 0 0 0 0 0 1 105 0 0 0 0 0 0 106 0 0 0 0 1 1 107 1 1 1 1 1 1 108 1 1 1 1 1 1 109 0 0 0 0 0 0 110 0 0 0 0 0 0 111 0 0 0 0 0 0 112 0 0 0 0 0 0 113 1 1 1 1 1 1 114 1 1 1 1 1 1 115 0 0 0 0 0 0 116 0 0 0 0 0 0 117 0 0 0 0 0 0 118 0 0 0 0 0 0 119 0 0 0 0 0 0 120 0 0 0 0 0 0 121 0 0 0 0 0 0 122 1 1 0 1 1 1 123 1 1 1 1 1 1 124 1 1 0 1 1 1 125 0 0 0 0 0 0 126 0 1 0 0 1 1 127 0 0 0 0 0 1 128 1 1 1 1 1 1 129 1 0 0 1 1 1 130 1 0 0 0 0 0 131 0 0 0 0 0 0 132 1 0 0 1 1 0 133 1 1 1 1 1 1 134 1 1 1 1 1 1 135 0 0 0 0 1 1 136 0 0 0 0 0 0 137 0 0 0 0 0 0 138 0 0 0 0 0 0 139 0 0 0 0 0 0 140 1 1 1 1 1 1 141 0 0 0 0 0 0 142 0 0 0 0 0 0 143 1 1 1 1 1 1 144 1 1 1 1 1 1 145 0 0 0 0 0 0 146 0 0 0 0 0 0 147 1 1 1 1 1 1 148 1 1 1 1 1 1 149 0 1 1 1 1 1 150 0 0 0 0 0 0 151 0 0 0 0 0 0 152 1 1 1 1 1 1 153 0 0 0 0 0 0 154 0 0 0 0 0 0 155 1 1 1 1 1 1 156 1 1 1 1 1 1 157 0 0 0 0 0 0 158 1 1 1 1 1 1 159 0 0 0 0 0 0 160 0 0 0 0 0 0 161 1 1 1 1 1 1 162 0 0 0 0 0 0 163 0 0 0 0 0 0 164 0 0 0 0 0 0 165 0 0 0 0 0 0 166 0 0 0 0 0 0 167 0 0 0 0 0 0 168 1 1 0 1 0 0 169 1 1 1 1 1 1 170 1 1 1 1 1 1 171 0 1 0 1 0 0 172 0 0 0 0 0 0 173 1 1 1 1 1 0 174 0 0 0 0 0 0 175 0 0 0 0 0 0 176 0 0 0 0 0 0 177 0 0 0 0 0 0 178 0 0 0 0 0 0 179 0 0 0 0 0 0 180 1 1 1 1 1 1 181 0 0 0 0 0 0 182 1 1 1 1 1 1 183 0 0 0 0 0 0 184 1 1 1 1 1 1 185 1 1 1 1 1 1 186 0 0 0 1 0 0 187 1 1 1 1 1 1 188 1 1 1 1 1 1 189 0 0 0 0 0 0 190 0 0 0 0 0 0 191 0 0 0 0 0 0 192 1 1 0 1 1 1 193 1 1 1 1 0 0 194 0 0 0 0 0 0 195 0 0 0 0 0 0 196 1 1 1 1 1 1 197 1 1 1 1 1 1 198 1 0 0 0 0 0 199 1 1 1 1 1 1 200 0 0 0 0 0 0 ...138k

In reply to Would Perl be a good choice for this? by Speed_Freak

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.