rkenshin09 has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

So, I'm basically trying to write a script that cuts a DNA sequence at specific points. I'm able to find the correct sequence within the string, but I can't for the life of me figure out how to split the string at the right site. Here's my code so far:

use strict; use Getopt::Std; #Invokes Get options package our ($opt_c,$opt_e); getopts('ce:'); my @seq = <>; chomp(@seq); my $seq = join("", @seq); our %re = ( 'BamHI' => 'GGATCC 1', # G | GATCC 'EcoRI' => 'GAATTC 1', # G | AATTC 'EcoRII' => 'CCWGG 0', # | CCWGG ); my @enz = split(/\b(\s+)\b/,$re{$opt_e}); my $enzseq = $enz[0]; my $cut = $enz[2]; print "The restriction site is: $enzseq\n\n"; while($seq =~ /($enzseq)/ig) { my $site = pos($seq),"\n"; print "$site\n"; }
Right now, it prints the locations of where to split the sequence, but I'm not exactly sure how to split the string at these specific location into fragments. For instance, if the string is 6000 characters, and the search string is found at 500, 1000, 2000, 4000, 5000, I want to store/print: (500, 500, 1000, 2000, 1000, 1000).

Any help would be greatly appreciated, Thanks.

~~~~~~~~~~

the current code prints:
The input sequence length is 48502 The restriction site is: GAATTC 21231 26109 31752 39173 44977

So, basically, I want to cut at each of those sites, so the fragments should have a length of 21231, (26109-21231), etc, where you subtract each shorter one from the longer one, where the last one is (48502-44977)

Replies are listed 'Best First'.
Re: splitting a string
by GrandFather (Saint) on Sep 26, 2008 at 03:09 UTC

    It may be that you are looking for something like:

    use strict; my $seq = '------------GGATCC________________GGATCC=============GGATCC ++++++++++'; my @splits = (0); push @splits, pos $seq while $seq =~ /(?=GGATCC)/ig; push @splits, length $seq; for my $index (0 .. $#splits) { my $width = $splits[$index + 1] - $splits[$index]; print substr ($seq, $splits[$index], $width), "\n"; }

    Prints:

    ------------ GGATCC________________ GGATCC============= GGATCC+++++++++

    However it's a bit hard to tell because you include uninteresting rubbish in your sample that means it is unrunnable, but you don't include any sample data and you don't show the result you would actually like.


    Perl reduces RSI - it saves typing
Re: splitting a string
by kyle (Abbot) on Sep 26, 2008 at 02:53 UTC

    Are you looking for substr?

    Maybe you want to use split again, sort of like:

    my @frangments = split /$enzseq/i, $seq;
Re: splitting a string
by johngg (Canon) on Sep 26, 2008 at 11:05 UTC
    If I have understood correctly, you could do a split using look-around assertions, see perlre, either using a look-ahead to split before your enzyme or a look-behind to split after it.

    use strict; use warnings; open my $seqFH, q{<}, \ <<'EOD' or die qq{open: << HEREDOC: $!\n}; CAGTTTCGATCGAATC ATTATTGGATCCGCAT GCGTCAGGATCCAATC EOD my $sequence = join q{}, map { chomp; $_ } <$seqFH>; close $seqFH or die qq{close: << HEREDOC: $!\n}; print qq{Sequence:\n $sequence\n}; my $enzymeSeq = q{GGATCC}; my $rxSplitBefore = qr{(?=$enzymeSeq)}; my $rxSplitAfter = qr{(?<=$enzymeSeq)}; print qq{Splitting before $enzymeSeq\n}; my @chunksBefore = split m{$rxSplitBefore}, $sequence; print qq{ $_\n} for @chunksBefore; print qq{Splitting after $enzymeSeq\n}; my @chunksAfter = split m{$rxSplitAfter}, $sequence; print qq{ $_\n} for @chunksAfter;

    The output.

    Sequence: CAGTTTCGATCGAATCATTATTGGATCCGCATGCGTCAGGATCCAATC Splitting before GGATCC CAGTTTCGATCGAATCATTATT GGATCCGCATGCGTCA GGATCCAATC Splitting after GGATCC CAGTTTCGATCGAATCATTATTGGATCC GCATGCGTCAGGATCC AATC

    I hope I have guessed correctly and this is of some use.

    Cheers,

    JohnGG

    Update: Corrected die text.

A reply falls below the community's threshold of quality. You may see it by logging in.