comment on

Greetings exalted monks. I've had a multi-year break from coding and seek help in modifying a script I just wrote that processes an input file that has the following format: alternating lines of IDs and sequences

>ID1
atgagcagctag
>ID2
tggacgagctgaca
.
.
>IDn
tgactagacggacatac
[download]

The goal here is to perform the following steps:

1. concatenate all these sequences,

2. shuffle substrings of it, each of of length l, as non-overlapping windows, and user pre-determined (hard coded to be 10^6 here)

3. concatenate the shuffled sequences,

4. split them into the same length distribution of sequences as in the input

5. wrote original IDs but shuffled sequences to the output file, in same format as in the input

For the example input above, ONE example output of randomly shuffled sequences would be:

>ID1
actggctagaag
>ID2
cggaccagggacta
.
.
>IDn
ctgaaaaaggagccttt
[download]

The code I've written is quite infantile but works OK for files ~ 100-150<B in size. However the problem I see is that for input files that are even modestly sized ~ 500MB, even a run on my university computer cluster is running out of ~ 40GB RAM.

So I seek tips to modify the logic and/or syntax so it will execute to completion even for an input file of size ~ 2GB (largest one I need to process), preferably even on a local machine config (mine has 8GB RMA)

Thanks in advance!

use strict;
use warnings;
use List::Util 'shuffle'; # Idea from http://www.perlmonks.org/?node_i
+d=199901

######################################################################
+####################

my $start_time =time;

my $input = shift @ARGV;
my $destination = shift @ARGV;

open(IN, '<', $input) or die "Can't read multifasta input genomic DNA 
+file $input : $!\n";

my $window = 1000000; # hard coded for shuffle window to be 1MB i.e 10
+^6
print "The length of the window shuffled is 1MB (i.e. 10^6 bp), it is 
+hard coded, you may change it in line 27 of this script. Thank you! \
+n";

my (@IDs, @lengths, @IDsLens, @output);

my $concat_seq="";
my $total_length= 0;

######################################################################
+####################

while (<IN>) {
    chomp;
    if ($_ =~ m/\>/)    {
    my $ID1 = my $ID2 = $_;
    push @IDs, $ID1;
    push @IDsLens, $ID2;
    }
    elsif ($_ !~ m/\>/)    {
    my $len1 = my $len2 = length($_);
    push @lengths, $len1;
    $total_length = $total_length + $len1;
    push @IDsLens, $len2;
    $concat_seq = $concat_seq.$_; # concatenating the entire genome in
+to one long sequence before breaking it for shuffling
    }
}

my %IDLen_hash = @IDsLens;

######################################################################
+####################

my $i = 0;
my $cat_shuf_full_seq="";

        for (my $i = 1; $i <= $total_length; $i=$i+$window ) { # perfo
+rm this operation below every 1MB, see line 21 above
            my $s = substr ($concat_seq, $i - 1, $window);
            my $rev = reverse $s;
            my @temp_seq_array;
            if ($i % 2 == 0)    {@temp_seq_array = split //, $rev;} # 
+throw in some reverse sequence alternatively to shuffle it more rando
+mly
            elsif ($i % 2 != 0)    {@temp_seq_array = split //, $s;}
           my @rand_seq_array = shuffle @temp_seq_array; # using the L
+ist::Util module
           my $rand_shuffled_substr = join ('', @rand_seq_array,);
           $cat_shuf_full_seq = $cat_shuf_full_seq.$rand_shuffled_subs
+tr; # concatenates the shuffled DNA seq to the 3' end of the previous
+ 1MB fragment
    }
    
    my @shuffle_concat_array = split("", $cat_shuf_full_seq);
    
######################################################################
+####################

foreach my $ID (@IDs)    {

my $contig_len = $IDLen_hash{$ID};
my @shuffle_concat_splice_seq = splice @shuffle_concat_array, 0, $cont
+ig_len; 
# this will progressively reduce the length of @shuffle_concat_splice_
+seq
# until the final splice operation will be for the length of the final
+ contig in original sequence
# which should also be the same as the length of the remaining array a
+ssigned to this array variable
my $cat_shuf_splice_final_subseq = join('', @shuffle_concat_splice_seq
+);
print $ID, "\n";
push @output, $ID, "\n", $cat_shuf_splice_final_subseq, "\n";
}

######################################################################
+####################

#my @input_name_split = split(".fa.shIDscleaned-up",$input);
#my $destination = $input_name_split[0]."_cat_shuf1MB_frag.fasta";
open(OUT, '>', $destination) or die "Can't write to file $destination:
+ $!\n";
print OUT @output;
close OUT;
my $end_time = time;
my $duration = ($end_time - $start_time)/60;
my $rounded = sprintf("%.1f", $duration);
print "Finished processing input, written output to ", $destination, "
+ in ", $rounded, " minutes \n";

######################################################################
+####################
[download]

In reply to Reduce RAM required by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.