Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need a Contig generating script which generates a Contiguous sequence from a list of SubSequences.
The input to this is a list of sub sequences(strings) which are the broken down from an original long sequence. the input can also contain sequences which do not belong to the final sequence. There can be more than one contiguous sequence generated out of the input data.
The size of each sub sequence could be between 200 to 500 characters. There can be up to 5000 input sub sequences. the generated contig can go be more than 1000 characters.
Let us say we have two sequences

my $str1 = "AATAGCAATTGACAAT";
my $str2 = "CAATCGGAACCAGCAT";

i.e to find match for the right edge of $str1 (let us say last 4 characters) with left edge of $str12(let us say first 4 characters). The number of matched characters can vary but it should be a exact match, It should be maximum possible, and the maximum number can be up to the the size of the smaller among the two strings. The minimum characters to match can be set at the command line. The matched string in the above example is CAAT.
By matching these two edge sequences we get a concatnated contig having AATAGCAATTGACAATCGGAACCAGCAT.
similarly by taking up more sequences from the input of sub sequences the final contig needs to be generated.
Thanks for reading this.
braj
  • Comment on How do I Gererate Contigs out of a list of sequences?

Replies are listed 'Best First'.
Re: How do I Gererate Contigs out of a list of sequences?
by busunsl (Vicar) on Apr 03, 2001 at 12:56 UTC
    This should work:
    my $str1 = "AATAGCAATTGACAAT"; my $str2 = "CAATCGGAACCAGCAT"; ($concat = "$str1#$str2") =~ s/(.*?)(.*)#\2(.*)/$1$2$3/; print $concat, "\n";
    Based on the idea from IO
Re: How do I Gererate Contigs out of a list of sequences?
by scain (Curate) on Apr 03, 2001 at 19:43 UTC
    I imagine you know this already, but if you are dealing with sequence data, this may not be the best way to do contigging. You have to deal with sequencing errors, removing vector and controlling repeat sequences. I would suggest you look into two things:
    • phrap, a contigging tool (I think it is free for academics), and
    • Bioperl, which has several objects and methods for dealing with sequence data
    Good luck,

    Scott

Re: How do I Gererate Contigs out of a list of sequences?
by $code or die (Deacon) on Apr 03, 2001 at 19:10 UTC
    ++ Are you using Perl to unravel the human genome?

    $ perldoc perldoc