utpalmtbi has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a sequence files (input_file.txt) with huge number of contigs such as (contig number does not reflect the order):

>contig number 11

tttgctcggaggggatc

>contig number 23

gaaaacacttccttattatacaggtaaaccgtatttggat

>contig number 3

aaagctcggaggggatcccct

... ..

I want to concatenate the contigs such that the above order is preserved, and also, I want to insert the sequence "nnnnncattccattcattaattaattaatgaatgaatgnnnnn" in each contig boundaries (here are two contig boundaries), such that the final output file would become as follows:

>concatenated contig tttgctcggaggggatcnnnnncattccattcattaattaattaatgaatgaatgnnnnngaaaacactt +ccttattatacaggtaaaccgtatttggatnnnnncattccattcattaattaattaatgaatgaatgn +nnnnaaagctcggaggggatcccct

For concatenation purpose, I use

perl -pe "chomp;s/>.+//" input_file.txt >output_file.txt

But I don't know how to insert the sequence in each contig boundary, plz help..

Replies are listed 'Best First'.
Re: merge sequences with new sequence insertion
by hdb (Monsignor) on Nov 26, 2013 at 20:33 UTC

    Read all lines into an array, grep the contigs, and then join them:

    use strict; use warnings; my @data = grep { !/^(>|\s)/ } <DATA>; chomp @data; print join "nnnnncattccattcattaattaattaatgaatgaatgnnnnn", @data; __DATA__ >contig number 11 tttgctcggaggggatc >contig number 23 gaaaacacttccttattatacaggtaaaccgtatttggat >contig number 3 aaagctcggaggggatcccct

    or a one liner using regexes:

    perl -e "undef $/; print join 'nnnnncattccattcattaattaattaatgaatgaatg +nnnnn', <> =~ /^([tcag]+)$/gm;" input_file.txt >output_file.txt
Re: merge sequences with new sequence insertion
by boftx (Deacon) on Nov 26, 2013 at 20:22 UTC

    Not sure if it can be done as a one-liner, but I would slurp the input file into an array, then use join to do the concatenation (with the boundary sequence as the join expression.)

    It helps to remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
Re: merge sequences with new sequence insertion
by rnaeye (Friar) on Nov 27, 2013 at 02:54 UTC

    How about something like this:

    #!/usr/bin/perl use warnings; use 5.12.4; my $string = "NNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNN"; while (<DATA>) { next if /^>.+/; next if /^$/; chomp; $_ .= $string unless eof; print; } __DATA__ >contig number 11 tttgctcggaggggatc >contig number 23 gaaaacacttccttattatacaggtaaaccgtatttggat >contig number 3 aaagctcggaggggatcccct ########## It prints: tttgctcggaggggatcNNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNNgaaaacactt +ccttattatacaggtaaaccgtatttggatNNNNNCATTCCATTCATTAATTAATTAATGAATGAATGN +NNNNaaagctcggaggggatcccct
Re: merge sequences with new sequence insertion
by 2teez (Vicar) on Nov 26, 2013 at 21:28 UTC

    Hi,
    You can read one line at a time (for I don't know how large the OP file is), chomp the line, then check if the lines meet your condition using a simply regex, and print it out with the string to insert at the end like so:

    use warnings; use strict; my $str = 'nnnnncattccattcattaattaattaatgaatgaatgnnnnn'; while(<DATA>){ chomp; print $_,$str if !/^>|^$/; } __DATA__ >contig number 11 tttgctcggaggggatc >contig number 23 gaaaacacttccttattatacaggtaaaccgtatttggat >contig number 3 aaagctcggaggggatcccct
    This works for me.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
      ... print it out with the string to insert at the end ...

      Doesn't that leave you with a  'nnnnnwhatevernnnnn' "joiner" sequence pasted at the end of the last line of the file? My interpretation of the OP was that joiners should be pasted only between the "contig" sequences. (But I have done no testing of your code.)

        Hi AnomalousMonk,
        Awoshhh!!. My bad! You are right. I think I read wrong, the OP requirement. I think, rnaeye has a better implementation of that.

        If you tell me, I'll forget.
        If you show me, I'll remember.
        if you involve me, I'll understand.
        --- Author unknown to me
Re: merge sequences with new sequence insertion
by oiskuu (Hermit) on Nov 26, 2013 at 22:32 UTC
    As a one-liner:
    perl -ne 'chomp; print "GLUE" x !!$n++, $_ unless /^>/'
Re: merge sequences with new sequence insertion
by Kenosis (Priest) on Nov 26, 2013 at 20:41 UTC

    Perhaps the following will be helpful:

    perl -ne 'chomp; $x=0; print if /^(?<!>)\S+$/' input_file.txt >output_ +file.txt

    Output on your dataset:

    tttgctcggaggggatcgaaaacacttccttattatacaggtaaaccgtatttggataaagctcggaggg +gatcccct

    Edit: Sorry... Didn't notice the "insert" spec.