Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I am getting myself confused with a very simple problem. I have a file containing gene sequences in fasta format (@file).
e.g. >gb|203-3411 ctacttactgtgactactgtgactgactgtgcatgacg catcatgcatgtgacttgactgactgatgctgactgct >gb|4670-5490 ctactgtgactgacttgactgactgtgactgtgctgac catgcagtactacttactgagtacgtctgtgcgt
I also have an array containing numbers to match some of the gene records (@numbers).
e.g. 456-3210 4670-5490
I want to simply extract all genes whose positions are in the numbers array. So the desired output would be this:
>gb|4670-5490 ctactgtgactgacttgactgactgtgactgtgctgac catgcagtactacttactgagtacgtctgtgcgt
Please could someone explain how this can be done. Many Thanks, AM

Replies are listed 'Best First'.
Re: Iterating through files and arrays simultaneously
by dragonchild (Archbishop) on Oct 06, 2003 at 16:39 UTC
    First off, you want to use a hash to associate a given identifier (4670-5490) to a gene sequence. You will want to do something like:
    my %sequences = map { $_ => undef } ( '456-3210', '4670-5490', # etc ... ); my $current; while (<FILE_WITH_STUFF>) { chomp; # If this line is a sequence identifier ... if (/^\d+(?:-\d+)?$/) { # Clear the sequence we're parsing $current = undef; # Skip this one unless it's a sequence we're looking for next unless exists $sequences{$_}; # Mark it as the one we're reading into $current = $_; next; } next unless $current; push @{$sequences{$current}}, $_; } foreach my $key (sort keys %sequences) { print "$key:\n"; print "\t$_\n" for @{$sequences{$key}}; }

    Try that and see if it does what you need. (NOTE: I did NOT test this code. It might not even compile.)

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Iterating through files and arrays simultaneously
by stajich (Chaplain) on Oct 06, 2003 at 18:19 UTC
    I assume those numbers are unqiue identifiers NOT ranges? Bio::SeqIO from Bioperl is going to be helpful if you are ever going to have sequences in different file formats.
    use Bio::SeqIO; my $in = new Bio::SeqIO(-file => 'filename', -format => 'fasta'); my %seqs; my @idlist = qw(456-3210 4670-5490); while( my $seq = $in->next_seq ) { my $id = $seq->display_id; $id =~ s/gb\|//; $seqs{$id} = $seq; } my @seqlist = grep { defined } map { $seqs{$_} } @idlist;
    If there are a LOT of sequences and you will be doing this repeatedly for different ID sets you can do this more efficiently with Bio::Index::Fasta.
    use Bio::Index::Fasta; my $idx = new Bio::Index::Fasta(-filename => 'seqs.idx', -write_flag => 1); $idx->id_parser(\&myidparser); $idx->make_index($seqfile); my @idlist = qw(456-3210 4670-5490); my @seqlist; for my $id ( @idlist ) { if( my $seq = $idx->get_Seq_by_acc($id) ) { push @seqlist, $seq; } } # define your own ID parser if you wanted to strip out # the gb| part of the id # alternatively don't do this and make sure the IDs # you input exactly match the IDs of the sequences sub myidparser { if( $_[0] =~ /^>\s*gb\|(\S+)/ ) { return $1; } elsif ($_[0] =~ /^>\s*(\S+)/) { return $1; } else { return; } }
Re: Iterating through files and arrays simultaneously
by broquaint (Abbot) on Oct 06, 2003 at 16:50 UTC
    If you're happy to slurp the file
    my @data = do { open(my $fh, '<', 'yourfile') or die("ack: $!"); local $/ = '>gb|'; scalar <$fh>; # first read will be useless <$fh>; }; my @nums = qw( 456-3210 4670-5490 ); for my $n (@nums) { print map { '>gb|' . substr($_, 0, -4) } grep { $_ =~ /^$n/ } @data; }
    So that will create an array consisting of chunks of the file delimited by >gb|, and then print any chunk that matches any of thenumber sequences in @nums.
    HTH

    _________
    broquaint

Re: Iterating through files and arrays simultaneously
by flounder99 (Friar) on Oct 06, 2003 at 17:51 UTC
    You could always join your @numbers file into one big regex. That way you don't have to slurp @file into memory. I have a feeling it may be huge.
    use strict; open NUMBERFILE, "<\@numbers" or die $!; my $regex = join "|", map {chomp; "^\\>gb\\|" . quotemeta($_) . "\$"} +<NUMBERFILE>; close NUMBERFILE; open DATAFILE, "<\@file" or die $!; my $flag = 0; while (<DATAFILE>) { if (substr($_,0,1) eq ">") { $flag = /$regex/; } print if $flag; }

    --

    flounder