concatenating identical sequences

$new_guy has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I would like to concatenate/join sequences (which happen to be protein sequences) that have the same "id" together on the same block. So the important thing in these protein sequences is their "id" (i.e. identifier). An "id" preceeds each sequence. (The "id's" are not always numerical, but always have two underscores (i.e. "_"), e.g 5390_7_9 ). The sequences are organized in blocks of text each beginning with an"id". For example:

5390_7_9   MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL
VKDIVAKGQT PFGIAGADAW TLNGYNQLAF ATATGGGKEA NQYLRYSQPN AIKLSDPIMK
DDIKVMDILR INGSKQKNWE GAGYTDVIGA FARGDVLMTP NGSWAITAIN EQKPNFKIGT
FMIPGKEKRQ SLTVGAGDLA WSISATTKHP KEANAFVEYM TRPEVMQKYY DVDGSPTAIE
GVKQAGEDSP LAGMTEYAFT DRHLVWLQQY WTSEADFHTL TMNYVLTGDK QGMVNDLNAF
FNPMKADVD
[download]

The sequence information might be on a single line or on multiple lines (as is the case above) starting from the line with it's "id" upto the end of the block. After which another "id" and its sequence follows.

So what I would like is to convert files like this (i.e. single "id" for joined blocks of sequences, ideally joined in the order they appear on the original "mixed-up" file):

5390_7_9   MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
5390_8_1   MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
5390_8_2   MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE
EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
5390_7_9   MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL
5390_8_1   MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
5390_8_2   MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE
EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL
[download]

To a single file that looks like this (Notice on each line of each block they are divided into some sort of "chunks" with 10 letters in each chunk, and each line has 6 such chunks (except for the very last chunk at the end of each block which might be shorter than 10 letters)):

5390_7_9   MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN
PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV
KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL

5390_8_1   MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE
EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE

5390_8_2   MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE
EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED
MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN
SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV
KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL
[download]

I am certain that the end result will result in sequences of equal length. So if I calciulate the length of just a single block I will be able to know the length of the others (well except in this example where 5390_8_1 in the end seems a bit shorter). I have these blocks in the tens of thousands.

I started out with the following small bit of code but got stuck.

#usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 1) {
    print "\n";
    print "Usage: script.pl <file>";
    print "\n";
    exit();
  }

my ($FILENAME) = @ARGV; #read in file
open(INFILE, $FILENAME);

## remove existing files    
my $remove = "new_alignment_".$FILENAME;  #remove any existing results
+ file
    if (unlink($remove) == 1) {
                        print "Existing \"$remove\" file was removed\n
+";
                          }

## generate storage file    
my $outputfile = "new_alignment_".$FILENAME; 
    unless ( open(POS, ">>$outputfile") ) {
    print "Cannot open file \"$outputfile\" to write to!!\n\n";
        exit;
        }

## declare variables
my @array1 = ();
my @no_duplicates = ();
my %seen = ();
our $protein_id;
my $element;
my $key;
my $line;

#read file and do stuff
while ($line = <INFILE>) {
    
    if($line =~/(\d+)_(\d+)_(\d+)/){
        #print POS $line."\n";    # check if the fasta file ID's print
+        
           push(@array1, $line);  # store all the id's in an array    
+   
        }

#remove duplicates in "array1" and store array elements in a new array
+ "no_duplicates"
    
    foreach my $a(@array1){ 
    unless ($seen{$a}){
        push (@no_duplicates, $a);
        chomp @no_duplicates;
        $seen{$a} = 1;
    }
    }
}

#now start poping a single element from the array each time writing ou
+t all sequences with the id
    while ($element = pop @no_duplicates){
        $protein_id = $element; 
             #print $protein_id."\n"; #check if no duplicates are kept
+ - it works
         
#now open the file again and start search for the popped element and s
+ee whether it marches an id
#if it does print the id and the all the blocks (joines together with 
+that id)
         while(my $line2 = <INFILE>){
        if($line2 =~ /$proitein_id/){  #says protein_id need explicit 
+package, why??
        print $line2;
        }
     }
    }
[download]

I would really appreciate help to move this forward!

Many Thanks

$new_guy

Comment on concatenating identical sequences Select or Download Code

Replies are listed 'Best First'.
Re: concatenating identical sequences by Limbic~Region (Chancellor) on Oct 04, 2011 at 15:51 UTC
$new_guy, How big are the files you need to work with. It should be fairly trivial to do this using a hash of arrays if everything fits in memory. Assume for a second you had a function that could fetch the next record (mutli-line or not) as well as the id. It would look something like this: `my %data; while (my $rec = fetch_record($fh)) { my $id = $rec->{id}; push @{$data{$id}}, $rec->{sequence}; } for my $id (keys %data) { print "$id "; print "$_\n" for @{$data{$id}}; }` [download] Alternatively, if you can't afford to fit the entire file in memory, you could still use this technique by storing the file offset and not the actual sequence. This will require more IO with tell and seek but should allow the same simplicity in the code. One last alternative would be to re-write the file merging all the rows for a record on one line. Next, sort the file so duplicate IDs are adjacent and then it should be straight forward to merge them. Since it appears each row is fixed length, recreating the original structure from a single line should be straight forward. Cheers - L~R	[reply] [d/l]
Re: concatenating identical sequences by BrowserUk (Patriarch) on Oct 04, 2011 at 16:26 UTC
I'd use a similar approach to Limbic~Region, but I'd accumulate the blocks as strings rather than arrays as they consume considerably less space, which will put off the time when you need to use a multi-pass solution. Building an array of newly read IDs as you go, allows for retaining the original ordering on output. This: #! perl -slw use strict; my( $id, %accu, @order ); while( <DATA> ) { chomp; if( m[^(\S+_\S+_\S+)\s+(.+)\s*$] ) { $id = $1; unless( exists $accu{ $id } ) { push @order, $id; $accu{ $id } = $2; } else { $accu{ $id } .= ' ' . $2; } } else { $accu{ $id } .= ' ' . $_; } } for my $key ( @order ) { printf "%-10s %s\n", $key, substr( $accu{ $key }, 0, 55, '' ); print substr( $accu{ $key }, 0, 66, '' ) while length $accu{ $key +}; print ''; } __DATA__ 5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED 5390_8_1 MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED 5390_8_2 MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED 5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL 5390_8_1 MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE 5390_8_2 MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL [download] Produces: C:\test>junk35 5390_7_9 MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED MEWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL 5390_8_1 MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE EITRDFEKEN PKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED MKWYKKIGLL ATTGLALVGL GACSNYGKSA DGTVTIEYFN QKKEMTKTLE 5390_8_2 MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED MEWYKKIGLL ATTALALFGL GACSNYGKSA DDTVTIEYFN QKKEMTKILE EITRDFEKEN SKIKVKVVNV PNAGEVLKTR VLAGDVPDVV NIYPQSIELQ EWAKAGVFED LSNKDYLKRV KNGYAEKYAV NEKVYNVPFT ANAYGIYYNK DKFEELGLKV PETWDEFEQL [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: concatenating identical sequences by $new_guy (Acolyte) on Oct 05, 2011 at 05:49 UTC
Thanks Limbic~Region and BrowserUk for your good advice. I have had a quick go with BrowserUk's script and it seems to work OK. However, the merge is not so clean and results after the merge seem not to be in equal chunks (I have posted this on to my scratchpad). I am speculating that this will continue as > continue to add more block (i.e thousands). I have adapted your script as follows: #! perl -slw use strict; if (scalar(@ARGV) != 1) { print "\n"; print "Usage: script.pl <alignment file>"; print "\n"; exit(); } my ($FILENAME) = @ARGV; #read in file open(DATA, $FILENAME); my( $id, %accu, @order ); ## remove existing files my $remove = "new_alignment_".$FILENAME; #remove any existing results + file if (unlink($remove) == 1) { print "Existing \"$remove\" file was removed\n +"; } ## generate a temporary storage file my $outputfile = "new_alignment_".$FILENAME; #make a big file in which + the final results will be printed unless ( open(POS, ">>$outputfile") ) { print "Cannot open file \"$outputfile\" to write to!!\n\n"; exit; } while( <DATA> ) { chomp; if( m[^(\S+_\S+_\S+)\s+(.+)\s*$] ) { $id = $1; unless( exists $accu{ $id } ) { push @order, $id; $accu{ $id } = $2; } else { $accu{ $id } .= ' ' . $2; } } else { $accu{ $id } .= ' ' . $_; } } for my $key ( @order ) { printf POS "%-10s %s\n", $key, substr( $accu{ $key }, 0, 55, '' ); print POS substr( $accu{ $key }, 0, 66, '' ) while length $accu{ $ +key }; print POS ''; } [download] I must also confess I do not understand the last bit of code: `for my $key ( @order ) { printf POS "%-10s %s\n", $key, substr( $accu{ $key }, 0, 55, '' ); print POS substr( $accu{ $key }, 0, 66, '' ) while length $accu{ $ +key }; print POS ''; }` [download] Could you kindly please explain it to me. PS: Still thinking of how to implement what Limbic~Region has pointed out. $new_guy	[reply] [d/l] [select]
Re^3: concatenating identical sequences by BrowserUk (Patriarch) on Oct 05, 2011 at 10:56 UTC
Or you could try this version first: `#! perl -slw use strict; my( $id, %accu, @order ); while( <DATA> ) { chomp; if( m[^(\S+_\S+_\S+)\s+(.+)\s*$] ) { $id = $1; unless( exists $accu{ $id } ) { push @order, $id; $accu{ $id } = $2; } else { $accu{ $id } .= ' ' . $2; } } else { my $pad = 10 - length() % 11; $accu{ $id } .= ' ' . $_ . ' ' x $pad; } } for my $key ( @order ) { printf "%-10s %s\n", $key, substr( $accu{ $key }, 0, 55, '' ); print substr( $accu{ $key }, 0, 66, '' ) while length $accu{ $key +}; print ''; }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: concatenating identical sequences by $new_guy (Acolyte) on Oct 10, 2011 at 12:14 UTC
Re^3: concatenating identical sequences by BrowserUk (Patriarch) on Oct 05, 2011 at 10:35 UTC
However, the merge is not so clean and results after the merge seem not to be in equal chunks (I have posted this on to my scratchpad). I missed this bit of your spec: "(except for the very last chunk at the end of each block which might be shorter than 10 letters)". Looks like you'll have to use Limbic~Region's solution of splitting the lines and storing the data as arrays. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]