Iterating through files and arrays simultaneously

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I am getting myself confused with a very simple problem. I have a file containing gene sequences in fasta format (@file).

e.g.
>gb|203-3411
ctacttactgtgactactgtgactgactgtgcatgacg
catcatgcatgtgacttgactgactgatgctgactgct
>gb|4670-5490
ctactgtgactgacttgactgactgtgactgtgctgac
catgcagtactacttactgagtacgtctgtgcgt
[download]

I also have an array containing numbers to match some of the gene records (@numbers).

e.g. 

456-3210
4670-5490
[download]

I want to simply extract all genes whose positions are in the numbers array. So the desired output would be this:

>gb|4670-5490
ctactgtgactgacttgactgactgtgactgtgctgac
catgcagtactacttactgagtacgtctgtgcgt
[download]

Please could someone explain how this can be done. Many Thanks, AM

Comment on Iterating through files and arrays simultaneously Select or Download Code

Replies are listed 'Best First'.
Re: Iterating through files and arrays simultaneously by dragonchild (Archbishop) on Oct 06, 2003 at 16:39 UTC
First off, you want to use a hash to associate a given identifier (4670-5490) to a gene sequence. You will want to do something like: my %sequences = map { $_ => undef } ( '456-3210', '4670-5490', # etc ... ); my $current; while (<FILE_WITH_STUFF>) { chomp; # If this line is a sequence identifier ... if (/^\d+(?:-\d+)?$/) { # Clear the sequence we're parsing $current = undef; # Skip this one unless it's a sequence we're looking for next unless exists $sequences{$_}; # Mark it as the one we're reading into $current = $_; next; } next unless $current; push @{$sequences{$current}}, $_; } foreach my $key (sort keys %sequences) { print "$key:\n"; print "\t$_\n" for @{$sequences{$key}}; } [download] Try that and see if it does what you need. (NOTE: I did NOT test this code. It might not even compile.) ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Iterating through files and arrays simultaneously by stajich (Chaplain) on Oct 06, 2003 at 18:19 UTC
I assume those numbers are unqiue identifiers NOT ranges? Bio::SeqIO from Bioperl is going to be helpful if you are ever going to have sequences in different file formats. `use Bio::SeqIO; my $in = new Bio::SeqIO(-file => 'filename', -format => 'fasta'); my %seqs; my @idlist = qw(456-3210 4670-5490); while( my $seq = $in->next_seq ) { my $id = $seq->display_id; $id =~ s/gb\\|//; $seqs{$id} = $seq; } my @seqlist = grep { defined } map { $seqs{$_} } @idlist;` [download] If there are a LOT of sequences and you will be doing this repeatedly for different ID sets you can do this more efficiently with Bio::Index::Fasta. use Bio::Index::Fasta; my $idx = new Bio::Index::Fasta(-filename => 'seqs.idx', -write_flag => 1); $idx->id_parser(\&myidparser); $idx->make_index($seqfile); my @idlist = qw(456-3210 4670-5490); my @seqlist; for my $id ( @idlist ) { if( my $seq = $idx->get_Seq_by_acc($id) ) { push @seqlist, $seq; } } # define your own ID parser if you wanted to strip out # the gb\| part of the id # alternatively don't do this and make sure the IDs # you input exactly match the IDs of the sequences sub myidparser { if( $_[0] =~ /^>\sgb\\|(\S+)/ ) { return $1; } elsif ($_[0] =~ /^>\s(\S+)/) { return $1; } else { return; } } [download]	[reply] [d/l] [select]
Re: Iterating through files and arrays simultaneously by broquaint (Abbot) on Oct 06, 2003 at 16:50 UTC
If you're happy to slurp the file `my @data = do { open(my $fh, '<', 'yourfile') or die("ack: $!"); local $/ = '>gb\|'; scalar <$fh>; # first read will be useless <$fh>; }; my @nums = qw( 456-3210 4670-5490 ); for my $n (@nums) { print map { '>gb\|' . substr($_, 0, -4) } grep { $_ =~ /^$n/ } @data; }` [download] So that will create an array consisting of chunks of the file delimited by `>gb\|`, and then print any chunk that matches any of thenumber sequences in `@nums`. HTH `_________ broquaint`	[reply] [d/l]
Re: Iterating through files and arrays simultaneously by flounder99 (Friar) on Oct 06, 2003 at 17:51 UTC
You could always join your @numbers file into one big regex. That way you don't have to slurp @file into memory. I have a feeling it may be huge. `use strict; open NUMBERFILE, "<\@numbers" or die $!; my $regex = join "\|", map {chomp; "^\\>gb\\\|" . quotemeta($_) . "\$"} +<NUMBERFILE>; close NUMBERFILE; open DATAFILE, "<\@file" or die $!; my $flag = 0; while (<DATAFILE>) { if (substr($_,0,1) eq ">") { $flag = /$regex/; } print if $flag; }` [download] -- flounder	[reply] [d/l]