onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:
Greetings exalted monks. I've had a multi-year break from coding and seek help in modifying a script I just wrote that processes an input file that has the following format: alternating lines of IDs and sequences
>ID1 atgagcagctag >ID2 tggacgagctgaca . . >IDn tgactagacggacatac
The goal here is to perform the following steps:
1. concatenate all these sequences,
2. shuffle substrings of it, each of of length l, as non-overlapping windows, and user pre-determined (hard coded to be 10^6 here)
3. concatenate the shuffled sequences,
4. split them into the same length distribution of sequences as in the input
5. wrote original IDs but shuffled sequences to the output file, in same format as in the input
For the example input above, ONE example output of randomly shuffled sequences would be:
>ID1 actggctagaag >ID2 cggaccagggacta . . >IDn ctgaaaaaggagccttt
The code I've written is quite infantile but works OK for files ~ 100-150<B in size. However the problem I see is that for input files that are even modestly sized ~ 500MB, even a run on my university computer cluster is running out of ~ 40GB RAM.
So I seek tips to modify the logic and/or syntax so it will execute to completion even for an input file of size ~ 2GB (largest one I need to process), preferably even on a local machine config (mine has 8GB RMA)
Thanks in advance!
use strict; use warnings; use List::Util 'shuffle'; # Idea from http://www.perlmonks.org/?node_i +d=199901 ###################################################################### +#################### my $start_time =time; my $input = shift @ARGV; my $destination = shift @ARGV; open(IN, '<', $input) or die "Can't read multifasta input genomic DNA +file $input : $!\n"; my $window = 1000000; # hard coded for shuffle window to be 1MB i.e 10 +^6 print "The length of the window shuffled is 1MB (i.e. 10^6 bp), it is +hard coded, you may change it in line 27 of this script. Thank you! \ +n"; my (@IDs, @lengths, @IDsLens, @output); my $concat_seq=""; my $total_length= 0; ###################################################################### +#################### while (<IN>) { chomp; if ($_ =~ m/\>/) { my $ID1 = my $ID2 = $_; push @IDs, $ID1; push @IDsLens, $ID2; } elsif ($_ !~ m/\>/) { my $len1 = my $len2 = length($_); push @lengths, $len1; $total_length = $total_length + $len1; push @IDsLens, $len2; $concat_seq = $concat_seq.$_; # concatenating the entire genome in +to one long sequence before breaking it for shuffling } } my %IDLen_hash = @IDsLens; ###################################################################### +#################### my $i = 0; my $cat_shuf_full_seq=""; for (my $i = 1; $i <= $total_length; $i=$i+$window ) { # perfo +rm this operation below every 1MB, see line 21 above my $s = substr ($concat_seq, $i - 1, $window); my $rev = reverse $s; my @temp_seq_array; if ($i % 2 == 0) {@temp_seq_array = split //, $rev;} # +throw in some reverse sequence alternatively to shuffle it more rando +mly elsif ($i % 2 != 0) {@temp_seq_array = split //, $s;} my @rand_seq_array = shuffle @temp_seq_array; # using the L +ist::Util module my $rand_shuffled_substr = join ('', @rand_seq_array,); $cat_shuf_full_seq = $cat_shuf_full_seq.$rand_shuffled_subs +tr; # concatenates the shuffled DNA seq to the 3' end of the previous + 1MB fragment } my @shuffle_concat_array = split("", $cat_shuf_full_seq); ###################################################################### +#################### foreach my $ID (@IDs) { my $contig_len = $IDLen_hash{$ID}; my @shuffle_concat_splice_seq = splice @shuffle_concat_array, 0, $cont +ig_len; # this will progressively reduce the length of @shuffle_concat_splice_ +seq # until the final splice operation will be for the length of the final + contig in original sequence # which should also be the same as the length of the remaining array a +ssigned to this array variable my $cat_shuf_splice_final_subseq = join('', @shuffle_concat_splice_seq +); print $ID, "\n"; push @output, $ID, "\n", $cat_shuf_splice_final_subseq, "\n"; } ###################################################################### +#################### #my @input_name_split = split(".fa.shIDscleaned-up",$input); #my $destination = $input_name_split[0]."_cat_shuf1MB_frag.fasta"; open(OUT, '>', $destination) or die "Can't write to file $destination: + $!\n"; print OUT @output; close OUT; my $end_time = time; my $duration = ($end_time - $start_time)/60; my $rounded = sprintf("%.1f", $duration); print "Finished processing input, written output to ", $destination, " + in ", $rounded, " minutes \n"; ###################################################################### +####################
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Reduce RAM required
by Corion (Patriarch) on Jan 08, 2019 at 10:44 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:14 UTC | |
|
Re: Reduce RAM required
by hippo (Archbishop) on Jan 08, 2019 at 10:36 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:15 UTC | |
|
Re: Reduce RAM required
by Eily (Monsignor) on Jan 08, 2019 at 11:03 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:27 UTC | |
|
Re: Reduce RAM required
by bliako (Abbot) on Jan 08, 2019 at 11:26 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:22 UTC | |
by bliako (Abbot) on Jan 10, 2019 at 16:41 UTC | |
by onlyIDleft (Scribe) on Jan 10, 2019 at 23:16 UTC | |
by vr (Curate) on Jan 11, 2019 at 10:51 UTC | |
by etj (Priest) on May 02, 2022 at 18:41 UTC | |
| |
|
Re: Reduce RAM required
by hdb (Monsignor) on Jan 08, 2019 at 10:35 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:10 UTC | |
|
Re: Reduce RAM required
by tybalt89 (Monsignor) on Jan 09, 2019 at 00:58 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:57 UTC | |
by tybalt89 (Monsignor) on Jan 09, 2019 at 17:50 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 19:05 UTC | |
by tybalt89 (Monsignor) on Jan 09, 2019 at 19:16 UTC | |
| |
|
Re: Reduce RAM required
by swl (Prior) on Jan 08, 2019 at 22:53 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:01 UTC | |
|
Re: Reduce RAM required
by jimpudar (Pilgrim) on Jan 08, 2019 at 22:22 UTC | |
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:04 UTC |