Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
In my code, after some lines, I end up with an array similar to this:
141326478 103194415 86004442 86004438 86004437 86004434 86004431 85280835 85280834 85280832 53250112 50137387 50137382 50137380 29223108 25694155 17916134

What I want to do is the following : create sub-arrays, where elements that are within the range of -1000 to +1000 of each other, will be put in the same sub-array. For example, 141326478 will be alone, since no other element of this array is -1000 to +1000 from that one, whereas 50137387, 50137382 and 50137380 will be grouped together.
I am really struggling with this for quite some time now and I have no idea on how to proceed. Any advice or snippet of code will be highly appreciated!

Replies are listed 'Best First'.
Re: Stuck with manipulating an array
by Corion (Patriarch) on Aug 28, 2017 at 13:15 UTC

    How about sorting the elements and binning them together until the distance between the first element in the bin and the next element is larger than 1000?

    If ou want to optimize for a minimal number of bins or something like that, you will need to put more thought into that. There is lots of literature for "good" binning, but if you only care about the distance between elements, the naive approach should get you far.

    If you show us the relevant code you already have, we can maybe give you more concrete advice.

      So, I am basically here:
      open IN2, $first_tmp; while(<IN2>) { if($_=~/^(chr.*?)REC:(.*)/) { $respective_chrom=$1; $all_entries=$2; @split_entries=(); @split_entries = split(/\#/, $all_entries); @split_sep_entries=(); %collapsed_loci_HoA=(); print ">".$respective_chrom."\n"; foreach $sep_entry(@split_entries) { @split_sep_entries = split(/\t/, $sep_entry); $locus_to_use = $split_sep_entries[1]; $rest_entry=$split_sep_entries[0]."\t".$split_sep_entries[ +2]."\t". $split_sep_entries[3]."\t".$split_sep_entries[ +4]."\t". $split_sep_entries[5]."\t".$split_sep_entries[ +6]."\t". $split_sep_entries[7]; push @{ $collapsed_loci_HoA{$locus_to_use} }, $rest_entry; } @array_of_loci = keys %collapsed_loci_HoA; for $b(sort { $b <=> $a } @array_of_loci) { $count_arr++; } print "//\n"; } } close IN2;

      and basically I am now getting my numbers sorted, as I posted above...
      What I cannot do is exactly this binning you propose, my thoughts are to slice each time one element of the array and, if it is within the range, push it to the sub-array of the element that created it, but I really can't see how to do that.
      I am new to Perl and I am literally stuck..

        My approach to binning would be simple. You look at the first element of the array @split_entries and the index of the potential candidates, and increase that index until the potential candidate is larger than your distance. All elements between the first element and the index of the potential candidate then belong into one bin.

        An example, for a distance of 5:

        11 12 16 17 22 30

        First you look at the first position in your array (11). The next candidate is at the second position, and its value is 12. abs(12-11) < 5, so you increase the index of your candidate. The next candidate is at the third position, and its value is 16. abs(16-11) >= 5, so your first bin are the first and second entries in the array, 11 and 12.

        Now, you start the same thing over, as there are still elements in your array after removing 11 and 12 from it.

        You look at the first position in your array (16). The next candidate is at the second position, and its value is 17. abs(16-17) < 5, so you increase the index of your candidate. The next candidate is at the third position, and its value is 22. abs(22-16) >= 5, so your first bin are the first and second entries in the array, 16 and 17.

        ... and so on.

        Sorry, wrong paste:
        open IN2, $first_tmp; while(<IN2>) { if($_=~/^(chr.*?)REC:(.*)/) { $respective_chrom=$1; $all_entries=$2; @split_entries=(); @split_entries = split(/\#/, $all_entries); @split_sep_entries=(); %collapsed_loci_HoA=(); print ">".$respective_chrom."\n"; foreach $sep_entry(@split_entries) { @split_sep_entries = split(/\t/, $sep_entry); $locus_to_use = $split_sep_entries[1]; $rest_entry=$split_sep_entries[0]."\t".$split_sep_entries[ +2]."\t". $split_sep_entries[3]."\t".$split_sep_entries[ +4]."\t". $split_sep_entries[5]."\t".$split_sep_entries[ +6]."\t". $split_sep_entries[7]; #print $locus_to_use."##".$rest_entry; push @{ $collapsed_loci_HoA{$locus_to_use} }, $rest_entry; } $count_arr=0; @array_of_loci = keys %collapsed_loci_HoA; for $b(sort { $b <=> $a } @array_of_loci) { print "$b"."\n"; } print "//\n"; } } close IN2;

        Now it is printing the numbers sorted.
Re: Stuck with manipulating an array
by tybalt89 (Monsignor) on Aug 28, 2017 at 13:36 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1198153 use strict; use warnings; use Data::Dumper; my $start; my @answer; for ( sort {$a <=> $b} map tr/\n//dr, <DATA> ) { if( not defined $start or $_ > $start + 1000 ) { push @answer, [ $_ ]; $start = $_; } else { push @{ $answer[-1] }, $_; } } print Dumper \@answer; __DATA__ 141326478 103194415 86004442 86004438 86004437 86004434 86004431 85280835 85280834 85280832 53250112 50137387 50137382 50137380 29223108 25694155 17916134
      seems to do the trick, thank you so much!
      Is this an array of arrays what you are creating, correct?

        Yes

Re: Stuck with manipulating an array
by BillKSmith (Monsignor) on Aug 28, 2017 at 14:34 UTC
    You have not responded to Corion's comment about "good" binning. Is any valid solution "good enough"? Do you have additional criteria, but do not know how to specify them? Consider how you would want to divide the list of integers (0..1001). (Note that there are over 1000 possible solutions using two bins. Far more if more bins are allowed.)
    Bill
      I think the answer/snippet provided by tybalt89 was exactly what I was after...
        Ok, I am basically stuck here:
        #!/usr/bin/perl use Data::Dumper; while(<DATA>) { $all_numbers=$_; chomp $all_numbers; @vector=(); @vector = split(/\@/, $all_numbers); $start; @answer; for ( sort {$a <=> $b} @vector) { if( not defined $start or $_ > $start + 1000 ) { push @answer, [ $_ ]; $start = $_; } else { push @{ $answer[-1] }, $_; } } for $i ( 0 .. $#answer ) { print "$i\t [ @{$answer[$i]} ]\n"; } print "//\n"; } __DATA__ 141326478@103194415@50137382@86004442@86004438@86004434@85280835@17916 +134@85280834@86004437@85280832@53250112@50137387@50137380@29223108@25 +694155@86004431 6901075@6901079@34073753@88911904@34073751@91346449@34073757

        If I only have 1 line of data, it works perfectly, but If I have these 2, it creates this:
        0 [ 17916134 ] 1 [ 25694155 ] 2 [ 29223108 ] 3 [ 50137380 50137382 50137387 ] 4 [ 53250112 ] 5 [ 85280832 85280834 85280835 ] 6 [ 86004431 86004434 86004437 86004438 86004442 ] 7 [ 103194415 ] 8 [ 141326478 ] // 0 [ 17916134 ] 1 [ 25694155 ] 2 [ 29223108 ] 3 [ 50137380 50137382 50137387 ] 4 [ 53250112 ] 5 [ 85280832 85280834 85280835 ] 6 [ 86004431 86004434 86004437 86004438 86004442 ] 7 [ 103194415 ] 8 [ 141326478 6901075 6901079 34073751 34073753 34073757 88911904 +91346449 ] //

        What am I doing wrong?
Re: Stuck with manipulating an array
by salva (Canon) on Aug 28, 2017 at 13:16 UTC
    well, forget about the computer, how would you solve the problem using your brain alone?