Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Venerable monks

I have some objects called $slice which basically represent a very long character string (e.g. millions of characters long). A code library i am using takes a $slice object and breaks it down into a set of subslices of a particular length. It returns an array of slice objects where each slice now represents a bit of the original string. For example it can take a slice of 100 characters and break it down into 10 * 10 character strings. These are returned in an array of $slice objects. For example slice 1 will now represent characters 1 to 10, slice 2 represents characers 11-20 etc. (actually the function works out the length of the slice to use based on user input

I thought the code would return slices/substrings in order when i looked at it. e.g. 1-10, 11-20, 21-30, but when using it the subslices are not in order. I am a perl beginner and can't modify the code to get it to do what i want. I was hoping you might be able to help. I thought they would be in order because the push function is putting each successive slice onto the end of the array.

Arg [1] : ref to list of slices Arg [2] : int maxlength of sub slices Arg [3] : int overlap length (optional) Example : my $sub_slice +s = split_Slices($slices,$maxlen,$overlap) Description: splits a sli +ce into smaller slices Returntype : ref to list of slices Exceptions : maxlen <1 or overlap < 0 sub split_Slices { my ($slice_big,$max_length,$overlap)=@_; if(!defined($max_length) or $max_length < 1){ throw("maxlength needs to be set and > 0"); } if(!defined($overlap)){ $overlap = 0; } elsif($overlap < 0){ throw("negative overlaps not allowed"); } my @out=(); foreach my $slice (@$slice_big){ my $start = $slice->start; my $end; my $multiple; my $number; my $length = $slice->length; if($max_length && ($length > $overlap)) { #No seq region may be longer than max_length but we want to make + #them all similar size so that the last one isn't much shorter. #Divide the seq_region into the largest equal pieces that are shorter + #than max_length #calculate number of slices to create $number = ($length-$overlap) / ($max_length-$overlap); $number = ceil($number); #round up to int #calculate length of created slices $multiple = $length / $number; $multiple = floor($multiple); #round down to int } else { #just one slice of the whole seq_region $number = 1; $multiple = $length; } my $i; for(my $i=0; $i < $number; $i++) { $end = $start + $multiple + $overlap; #any remainder gets added to the last slice of the seq_region + $end = $slice->end if($i == $number-1); push @out, Bio::EnsEMBL::Slice->new (-START => $start, -END => $end, -STRAND => 1, -SEQ_REGION_NAME => $slice->seq_region_name, -SEQ_REGION_LENGTH => $slice->seq_region_length, -COORD_SYSTEM => $slice->coord_system, -ADAPTOR => $slice->adaptor); $start += $multiple + 1; } } return\@ out; } 1; }
thanks very much

Replies are listed 'Best First'.
Re: sorting problem
by oko1 (Deacon) on Dec 28, 2010 at 16:38 UTC

    The first thing I noticed is an error in your script that would stop it from working correctly (albeit nothing to do with the sort order problem):

    $multiple = floor($multiple); #round down to int } else {

    That '} else {' does nothing, since it comes after a comment; as a result, all of your "$number"s and "$multiple"s get set to their default values, which is probably not what you want. Just move that clause down to the next line; i.e., it should read

    ... $multiple = $length / $number; $multiple = floor($multiple); } else { $number = 1; $multiple = $length; } ...

    As for the rest of it, I just faked up a bit of a script to test this and fed it a "slice" of the sort you're talking about, and it came out sorted. Example:

    #!/usr/bin/perl use warnings; use strict; use POSIX qw/ceil floor/; my $len = shift or die "Need a MAX_LENGTH argument\n"; sub split_Slices { my ($slice_big,$max_length,$overlap)=@_; my @out; for my $slice (@$slice_big){ my ($start, $length, $end, $multiple, $number) = ($slice->{begin}, + $slice->{length}); if($max_length && ($length > $overlap)) { $number = ceil(($length-$overlap) / ($max_length-$overlap)); $multiple = floor($length / $number); } else { $number = 1; $multiple = $length; } for(my $i=0; $i < $number; $i++) { $end = $start + $multiple + $overlap; push @out, { begin => $start, end => $end }; $start += $multiple + 1; } } return \@out; } my @stuff; push @stuff, { begin => $_ * 10, length => 10 } for 0..9; my $ret = split_Slices(\@stuff, $len, 0); use Data::Dumper; print Dumper($ret);

    Running it with a couple of different arguments produces the following:

    ./slicer 10 $VAR1 = [ { 'begin' => 0, 'end' => 10 }, { 'begin' => 10, 'end' => 20 }, { 'begin' => 20, 'end' => 30 }, { 'begin' => 30, 'end' => 40 }, { 'begin' => 40, 'end' => 50 }, { 'begin' => 50, 'end' => 60 }, { 'begin' => 60, 'end' => 70 }, { 'begin' => 70, 'end' => 80 }, { 'begin' => 80, 'end' => 90 }, { 'begin' => 90, 'end' => 100 } ];
    ./slicer 5 $VAR1 = [ { 'begin' => 0, 'end' => 5 }, { 'begin' => 6, 'end' => 11 }, { 'begin' => 10, 'end' => 15 }, { 'begin' => 16, 'end' => 21 }, { 'begin' => 20, 'end' => 25 }, { 'begin' => 26, 'end' => 31 }, { 'begin' => 30, 'end' => 35 }, { 'begin' => 36, 'end' => 41 }, { 'begin' => 40, 'end' => 45 }, { 'begin' => 46, 'end' => 51 }, { 'begin' => 50, 'end' => 55 }, { 'begin' => 56, 'end' => 61 }, { 'begin' => 60, 'end' => 65 }, { 'begin' => 66, 'end' => 71 }, { 'begin' => 70, 'end' => 75 }, { 'begin' => 76, 'end' => 81 }, { 'begin' => 80, 'end' => 85 }, { 'begin' => 86, 'end' => 91 }, { 'begin' => 90, 'end' => 95 }, { 'begin' => 96, 'end' => 101 } ];

    So it seems that the sorting part of it works fine (although I'd take a closer look at that overlap routine.) Perhaps you should examine the data being fed to it.

    -- 
    Education is not the filling of a pail, but the lighting of a fire.
     -- W. B. Yeats
Re: sorting problem
by Anonymous Monk on Dec 28, 2010 at 15:31 UTC
    I think i am correct that the subslices are in order and the problem must lie elsewhere?