Istirion has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

I have a programming problem which I can't solve on my own and therefor Id like to ask for your help.

Earlier in the programm I read a complete inputfile via <STDIN>, process and split it in pieces and get something like a really big array of strings. Each string is a part of DNA, so it looks like "ATCGCAT..." (very long!). I can join the complete array into one string for further analysis of the DNA. But I also would like to do this:

1) Check every item of the array, starting with [0] if it passes an easy test (here: I just want to see, if the string of bases can be divided by 3 and therefor has the correct number for further analysis and transformation into aminoacids)

2) Join all strings in the array which passed the test into one big string (>6 mio. characters), letting out those which didn't pass (wrong number of bases).

3) Send some info to a logfile, which says something like this "Found x blocks (x = number of elemens in the array = $#array), joined y of them and left out z." The leftout parts should follow, like

Not used blocks# 1200 1500 5000 ...

I think it could be really easily solved, but right now I don't see the solution... :-(

Thank you very much for your help!
  • Comment on Analyse an array and join only some elements of it

Replies are listed 'Best First'.
Re: Analyse an array and join only some elements of it
by moritz (Cardinal) on Jun 15, 2012 at 13:11 UTC
    my @dna_chunks = ...; sub your_test { my $string = shift; # do your test here, and return 1 if it passes } my @used = grep your_test($_), @dna_chunks; my $long_string = join '', @used; print "Used ", scalar(@used), " chunks, ",@dna_chunks - @used, " disca +red";

    See grep for details.

      The combination of join and grep sounds like it will do what you want. (map and sort and some of their friends from List::Util and List::MoreUtils are also often useful, but don't sound needed in your case.)

      The following example uses join and grep to filter the numbers from 1 to 20 for primes, then join the primes together into a string separated by semicolons.

      sub is_prime { my $num = shift; return if $num == 1; # 1 is not prime die "usage: is_prime(NATURAL NUMBER)" unless $num =~ /^[1-9][0-9]*$/; for my $div (2 .. sqrt $num) { return if $num % $div == 0; } return 1; } my @numbers = 1 .. 20; print join ";", grep { is_prime($_) } @numbers;
      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Analyse an array and join only some elements of it
by jwkrahn (Abbot) on Jun 15, 2012 at 14:30 UTC

    It sounds like you need something like this:

    my ( $fasta, $valid, $invalid ); while ( <STDIN> ) { tr/ATCGatcg//cd; next unless /[ATCGatcg]/; if ( length % 3 ) { $invalid++; next; } $valid++; $fasta .= $_; } print LOGFILE "Found ", $valid + $invalid, " blocks, joined $valid of +them and left out $invalid.\n";
Re: Analyse an array and join only some elements of it
by AnomalousMonk (Archbishop) on Jun 15, 2012 at 17:56 UTC
    >perl -wMstrict -le "use List::MoreUtils qw(part); ;; my @strings = qw(X XY AAA WXYZ VWXYZ BBBCCC TUVWXYZ); ;; use constant { X3 => 0, NX3 => 1, }; my @indices = part { length($strings[$_]) % 3 ? NX3 : X3 } 0 .. $#strings; ;; print qq{indices of strings whose lengths are...}; print qq{ even multiples of 3: @{$indices[X3]}}; print qq{ non-even multiples of 3: @{$indices[NX3]}}; ;; my $aminos = join '', @strings[ @{$indices[X3]} ]; print qq{amino acids: '$aminos'}; " indices of strings whose lengths are... even multiples of 3: 2 5 non-even multiples of 3: 0 1 3 4 6 amino acids: 'AAABBBCCC'

    From the OP:

    ... Found x blocks (x = number of elemens in the array = $#array) ...

    $#array is the highest index of an element in the array. The number of elements would be given by  scalar(@array) or, in general, by evaluating  @array in scalar context, e.g.,  0+@array or  @array == 3 or some such.

Re: Analyse an array and join only some elements of it
by Cristoforo (Curate) on Jun 16, 2012 at 15:54 UTC
    From your description, it sounds like you are parsing a 'fasta' format file line by line and, if it is evenly divisible by 3, join that to a string of sequences passing that test. I'm not sure, but if a single line of the sequence is divisible by 3, the entire sequence may not. The method below reads in an entire sequence at a time, and checks for mod 3 == 0, (evenly divisible by 3).

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "input1.txt" , -format => 'fasta'); my ($string, $valid, $invalid) = ('', 0, 0); while ( my $seq = $in->next_seq() ) { if ($seq->length % 3 == 0) { $string .= $seq->seq; ++$valid; } else { ++$invalid; } } open LOGFILE, ">", 'somefile' or die "Unable to open 'somefile' for wr +iting. $!"; print LOGFILE "Found ", $valid + $invalid, " blocks, joined $valid of +them and left out $invalid.\n"; close LOGFILE or die $!;

    Chris

      Hi again! Thank you very much for all of your help and really good ideas. I will try all of them and see which works best in my programm. Christoforo, you're completely right: the input file is an annotated fasta file, which I have to read line by line and analyse for Codon Usage and so on. I didn't use Bioperl for creating a seqobject but instead did all the work by myself. :-) Your wellthought objection can be clearly answered, I think. If all parts are divisible by three the entire sequenz should be also divisible by three, because it consists of parts which are divisible by three. Thanks again, Markus