nicholaspr has asked for the wisdom of the Perl Monks concerning the following question:

seq1='--TAGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC-' seq2='-ATTGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC-' First question: I want to find the positions in seq1 were there is a '-' character and delete those positions in seq2.Is it possible to do it without a for loop?? Second question: Ignore any positions in seq1 and seq2 that have the character '-' and find the positions were the two strings differ..for example in position 4 there is an 'A' in seq1 and a 'T' in seq2...Again if possible without having to do a for loop Thank you Nicholas

Replies are listed 'Best First'.
Re: question 'string'
by toolic (Bishop) on Mar 02, 2010 at 15:43 UTC
    I want to find the positions in seq1 were there is a '-' character
    index
    and delete those positions in seq2.
    substr
    Is it possible to do it without a for loop??
    Possibly, but why do you have this constraint?

    Read the docs that I pointed to, try it with your own code, then, if you still have problems, post the code you have tried, along with actual and expected output.

    Also, please do not post general Perl questions in the 'PerlMonks Discussion' section. Read Where should I post X?.

Re: question 'string'
by almut (Canon) on Mar 02, 2010 at 16:11 UTC
    ...and delete those positions in seq2

    What exactly do you mean by "delete"? Cut out the chars at those positions, or somehow tag them as "invalid" (e.g. by also putting "-" in those places), or even something else?  I.e., is the result supposed to be

    1-in: --TAGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC- 2-in: -ATTGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC- ==================================================== 2-out: TTGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC

    or

    1-in: --TAGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC- 2-in: -ATTGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC- ==================================================== 2-out: --TTGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC-

    I'm not much of a biologist, but keeping the sequences aligned (the latter variant) somehow seems to make more sense... (?)

Re: question 'string'
by AnomalousMonk (Archbishop) on Mar 02, 2010 at 17:49 UTC

    As almut wrote, a lot depends on just what you mean by 'delete'; the Devil is in the details.

    But in general, this sort of thing is generally handled 'without a for loop' by bitwise boolean operations on strings (see examples below). BrowserUk is very good on this topic (as on so much else), and I seem to remember him or her addressing a similar question in the last month or two, but I can't put my finger on the node at the moment; look back through BrowserUk's posts and you should find much of interest on this subject.

    Examples (these are by no means intended to represent the most efficient approaches to these problems!):

    >perl -wMstrict -le "my $seq1 = '--TAGAG--T'; my $seq2 = '-ATTGAGATT'; print 'question 1'; (my $mask1 = $seq1) =~ tr{-ATGC}{\x00\xff}; print qq{seq1 '$seq1'}; print qq{seq2 '$seq2'}; $seq2 &= $mask1; $seq2 =~ tr{\x00}{-}; print qq{seq2 '$seq2'}; print 'question 2'; $seq1 = 'G-TATAG'; $seq2 = 'GATCT-G'; print qq{seq1 '$seq1'}; print qq{seq2 '$seq2'}; ( $mask1 = $seq1) =~ tr{-ATGC}{\x00\xff}; (my $mask2 = $seq2) =~ tr{-ATGC}{\x00\xff}; my $dmask = $mask1 & $mask2; $seq1 &= $dmask; $seq2 &= $dmask; my $diff = $seq1 ^ $seq2; $diff =~ tr{\x00-\xff}{=D}; print qq{diff '$diff'}; " question 1 seq1 '--TAGAG--T' seq2 '-ATTGAGATT' seq2 '--TTGAG--T' question 2 seq1 'G-TATAG' seq2 'GATCT-G' diff '===D==='

    Update: Slightly improved example for Question 1.

Re: question 'string'
by Anonymous Monk on Mar 02, 2010 at 15:45 UTC
    $ perl seq1='--TAGAGATTGCCCGTAGGACGGGAAGGTGTCAACGTTTTACATTTTGAAC-' ^Z Can't modify constant item in scalar assignment at - line 1, at EOF Execution of - aborted due to compilation errors.
    I suggest you read perlintro, then RFC: Bioinformatics Tutorial