Re: Would Perl be a good choice for this?

> ..where I would even start.

Hello Speed_Freak, you question is confusing me: too much data, no code at all, no code from your part, no expected results and I do not really well understand this subgroups and the goal..

But since you are asking where to start.. know your data is a good suggestion and and another good quote sounds like: when you know deeply your data, then algorithm is a matter of simply implementation.

So where to start? ordering => array and indexing => hash

I mean that when you are processing your data you split up elements and fill a datastructure that suits your needs. So the basic is a simple loop that consumes lines of data:


use strict;
use warnings;

while (<DATA>){
  chomp;
  my @ele = split /\s/,$_;
[download]

Now that you has @ele you need to coherce it to your logic: so supposing you need to store which ID ( $ele[0] ) has $ele[1] + $ele[2] you can indexing the $ele[1] $ele[2] presence and use it as key of an hash and pushing IDs as values of an anonymous array:

use strict;
use warnings;

my %res;
while (<DATA>){
  chomp;
  my @ele = split /\s/,$_;

  push @{ $res{"$ele[1] $ele[2]"} }, $ele[0];
}
__DATA__
1 monkey cow hammer nail
2 monkey sheep hammer nail
3 dog cat hammer nail
4 monkey cow hammer nail
[download]

this leads you to a datastructure like: ("dog cat", [3], "monkey sheep", [2], "monkey cow", [1, 4])

If you just need to know which ID has monkey you'll loop keys of the hash searching the pattern monkey as in:

foreach my $key (keys %res){

  if ($key =~ /monkey/) {

     print "monkey [occurence in $key] found in IDs:", (join ', ', @{$
+res{$key}}), "\n";
[download]

This is my where to start

PS perldsc and (2004)Using Perl for Statistics: Data Processing and Statistical Computing as readmore suggestions.

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Comment on Re: Would Perl be a good choice for this? Select or Download Code

Replies are listed 'Best First'.
Re^2: Would Perl be a good choice for this? by Speed_Freak (Beadle) on Oct 02, 2017 at 20:32 UTC
Thanks for the response! Sorry for not including any code, I haven't even gotten that far yet. Maybe I can try to better explain what I am doing if you're interested... The markers are actually genetic sequences (1-138k, yes/no for presence), the items are samples, and the sub-groups are animals. I'm using an R program that uses a gibbs sampler to look for the commonality between the know sub-groups and an unknown sample... The idea being, that you can identify proportions of the known sub-groups in the unknown sample. I currently have a large library of known samples that correspond to various sub-groups of animals. But the 138k markers are causing the R script to bog down substantially. (4+ days per unknown due to single core limitations.) So I want to choose a subset of the 138k markers to run. Ideally this subset would have markers that are unique to each sub-group, but the "uniqueness" could be variable. As in, total list output per subgroup, and % unique from other subgroups. (By altering parameters, I would be able to request a list of 10k ID's from each subgroup that are 80% dissimilar from every other sub-group. Or a list of 5k that are 95% dissimilar...etc.) I definitely need to read up on statistics to figure out what I'm actually asking for!	[reply]

Replies are listed 'Best First'.

Re^2: Would Perl be a good choice for this?
by Speed_Freak (Beadle) on Oct 02, 2017 at 20:32 UTC

Thanks for the response! Sorry for not including any code, I haven't even gotten that far yet.

Maybe I can try to better explain what I am doing if you're interested... The markers are actually genetic sequences (1-138k, yes/no for presence), the items are samples, and the sub-groups are animals. I'm using an R program that uses a gibbs sampler to look for the commonality between the know sub-groups and an unknown sample... The idea being, that you can identify proportions of the known sub-groups in the unknown sample.

I currently have a large library of known samples that correspond to various sub-groups of animals. But the 138k markers are causing the R script to bog down substantially. (4+ days per unknown due to single core limitations.) So I want to choose a subset of the 138k markers to run. Ideally this subset would have markers that are unique to each sub-group, but the "uniqueness" could be variable. As in, total list output per subgroup, and % unique from other subgroups. (By altering parameters, I would be able to request a list of 10k ID's from each subgroup that are 80% dissimilar from every other sub-group. Or a list of 5k that are 95% dissimilar...etc.) I definitely need to read up on statistics to figure out what I'm actually asking for!

[reply]