$new_guy has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
I would like to concatenate/join sequences (which happen to be protein sequences) that have the same "id" together on the same block. So the important thing in these protein sequences is their "id" (i.e. identifier). An "id" preceeds each sequence. (The "id's" are not always numerical, but always have two underscores (i.e. "_"), e.g 5390_7_9 ). The sequences are organized in blocks of text each beginning with an"id". For example:
5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL VKDIVAKGQT PFGIAGADAW TLNGYNQLAF ATATGGGKEA NQYLRYSQPN AIKLSDPIMK DDIKVMDILR INGSKQKNWE GAGYTDVIGA FARGDVLMTP NGSWAITAIN EQKPNFKIGT FMIPGKEKRQ SLTVGAGDLA WSISATTKHP KEANAFVEYM TRPEVMQKYY DVDGSPTAIE GVKQAGEDSP LAGMTEYAFT DRHLVWLQQY WTSEADFHTL TMNYVLTGDK QGMVNDLNAF FNPMKADVD
The sequence information might be on a single line or on multiple lines (as is the case above) starting from the line with it's "id" upto the end of the block. After which another "id" and its sequence follows.
So what I would like is to convert files like this (i.e. single "id" for joined blocks of sequences, ideally joined in the order they appear on the original "mixed-up" file):
To a single file that looks like this (Notice on each line of each block they are divided into some sort of "chunks" with 10 letters in each chunk, and each line has 6 such chunks (except for the very last chunk at the end of each block which might be shorter than 10 letters)):5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED 5390_8_1 MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED 5390_8_2 MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED 5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL 5390_8_1 MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE 5390_8_2 MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL
5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL 5390_8_1 MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE 5390_8_2 MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL
I am certain that the end result will result in sequences of equal length. So if I calciulate the length of just a single block I will be able to know the length of the others (well except in this example where 5390_8_1 in the end seems a bit shorter). I have these blocks in the tens of thousands.
I started out with the following small bit of code but got stuck.
#usr/bin/perl -w use strict; if (scalar(@ARGV) != 1) { print "\n"; print "Usage: script.pl <file>"; print "\n"; exit(); } my ($FILENAME) = @ARGV; #read in file open(INFILE, $FILENAME); ## remove existing files my $remove = "new_alignment_".$FILENAME; #remove any existing results + file if (unlink($remove) == 1) { print "Existing \"$remove\" file was removed\n +"; } ## generate storage file my $outputfile = "new_alignment_".$FILENAME; unless ( open(POS, ">>$outputfile") ) { print "Cannot open file \"$outputfile\" to write to!!\n\n"; exit; } ## declare variables my @array1 = (); my @no_duplicates = (); my %seen = (); our $protein_id; my $element; my $key; my $line; #read file and do stuff while ($line = <INFILE>) { if($line =~/(\d+)_(\d+)_(\d+)/){ #print POS $line."\n"; # check if the fasta file ID's print + push(@array1, $line); # store all the id's in an array + } #remove duplicates in "array1" and store array elements in a new array + "no_duplicates" foreach my $a(@array1){ unless ($seen{$a}){ push (@no_duplicates, $a); chomp @no_duplicates; $seen{$a} = 1; } } } #now start poping a single element from the array each time writing ou +t all sequences with the id while ($element = pop @no_duplicates){ $protein_id = $element; #print $protein_id."\n"; #check if no duplicates are kept + - it works #now open the file again and start search for the popped element and s +ee whether it marches an id #if it does print the id and the all the blocks (joines together with +that id) while(my $line2 = <INFILE>){ if($line2 =~ /$proitein_id/){ #says protein_id need explicit +package, why?? print $line2; } } }
I would really appreciate help to move this forward!
Many Thanks
$new_guy
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: concatenating identical sequences
by Limbic~Region (Chancellor) on Oct 04, 2011 at 15:51 UTC | |
|
Re: concatenating identical sequences
by BrowserUk (Patriarch) on Oct 04, 2011 at 16:26 UTC | |
by $new_guy (Acolyte) on Oct 05, 2011 at 05:49 UTC | |
by BrowserUk (Patriarch) on Oct 05, 2011 at 10:56 UTC | |
by $new_guy (Acolyte) on Oct 10, 2011 at 12:14 UTC | |
by BrowserUk (Patriarch) on Oct 05, 2011 at 10:35 UTC |