Re: Fasta Using Perl

Replies are listed 'Best First'.
Re^2: Fasta Using Perl by talexb (Chancellor) on Jan 23, 2005 at 16:18 UTC
It looks like you're trying to catch the individual pairs from this part of the output: `40 50 60 70 80 90 HAHU TTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVN ... ..... . : ::: :.. ..: :. CG1674 MDSTLNIENVNDPTSIASDLSAENTKADLVS 10 20 30` [download] If that's the case, then you need to write a Perl script to extract those sequences from the output. It looks like HAHU is the sample, and the other sequences are from the library. So maybe that means you want to capture the HAHU bits -- I'm not that clear. Anyway, I've hacked up a bit of Perl that should help you get started -- it's all I have time for now, Have to attend to a sick Cygwin installation, make waffles for the family, attend a funeral (really) so .. #!/usr/bin/perl -w use strict; while(<DATA>) { print "---------------\n"; if (/^(\s+\d{2,3})+/) { # Start of block print "Analyze:\n$_"; # Here I'm just grabbing individual lines from the # fasta output into variables. There's the sample # scale, the sample, the match (dots and colons), # the library and the library scale. my $samScale = $_; my $sample = <DATA>; my $match = <DATA>; my $library = <DATA>; my $libScale = <DATA>; # I'm using a regular expression to figure out how # how long the leading blanks are and how long the # trailing blanks are. my ( $endBlanks, $startBlanks ) = $match =~ /^((\s+).+?)\s+$/; print "Start at " . length($startBlanks); print ", end at " . length($endBlanks) . "\n"; # Since the regular expression grabbed the relevant # pieces of the strong but we just want the length, # we do that conversino here. my ( $start, $end ) = ( length($startBlanks), length($endBlanks) ); # Done .. print out the matching parts. print "Sample match is: " . substr($sample,$start, $end-$start) . "\n"; print "Library match is: " . substr($library,$start, $end-$start) . "\n"; } else { # Skip the parts that appear to be commentary. # Debug code, thuse commented out but left behind. # print "Skip:\n$_"; } } __DATA__ 40 50 60 70 80 90 HAHU TTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVN ... ..... . : ::: :.. ..: :. CG1674 MDSTLNIENVNDPTSIASDLSAENTKADLVS 10 20 30 100 110 120 130 140 HAHU FKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR .. . .. :. : :: : : : ::.: CG1674 LNEPNVNDQTSSASDLTAENTKADHDSLNKPKDFNNQILNIISDIDINIKAQEKITQLKE 40 50 60 70 80 90 >>CG11153-PA type=protein; loc=4:complement(821536..8223 (580 aa) initn: 43 init1: 43 opt: 69 Z-score: 84.3 bits: 23.5 E(): 1.3 Smith-Waterman score: 69; 45.455% identity (48.387% ungapped) in 33 a +a overlap (57-89:513-543) 30 40 50 60 70 80 HAHU EALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL : ...:: : . :: :..:: : :: : CG1115 AEMRQLWCRTGGVSGGSGSLCADACPKGSGGSNSQVAVAAAAAVYHLQDM--ASSAASTA 490 500 510 520 530 540 [download] When I run this I get the following matches: --------------- Analyze: 40 50 60 70 80 90 Start at 37, end at 67 Sample match is: NAVAHVDDMPNALSALSDLHAHKLRVDPVN Library match is: DSTLNIENVNDPTSIASDLSAENTKADLVS --------------- --------------- Analyze: 100 110 120 130 140 Start at 7, end at 37 Sample match is: FKLLSHCLLVTLAAHLPAEFTPAVHASLDK Library match is: LNEPNVNDQTSSASDLTAENTKADHDSLNK --------------- --------------- --------------- --------------- --------------- --------------- --------------- Analyze: 30 40 50 60 70 80 Start at 37, end at 65 Sample match is: GHGKKVADALTNAVAHVDDMPNALSALS Library match is: GSNSQVAVAAAAAVYHLQDM--ASSAAS [download] Anyway, this is all a wild guess based on the output you've provided. There's obviously more to do .. you want to match up the first and second pieces, since I can see those two are part of the same string, but .. I don't pretend to know anything about biochemistry .. so I leave that up to you. To learn Perl, I highly recommend you get a copy of Learning Perl and then Programming Perl, both excellent books from O'Reilly, available either from your local computer bookstore or over the web. Perl may be a little difficult to learn, but it's an amzingly powerful tool once you get familiar with it. Good Luck! Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Fasta Using Perl
by talexb (Chancellor) on Jan 23, 2005 at 16:18 UTC

It looks like you're trying to catch the individual pairs from this part of the output:

        40        50        60        70        80        90       
HAHU   TTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVN
                                     ... ..... .  :  ::: :.. ..: :.
CG1674                              MDSTLNIENVNDPTSIASDLSAENTKADLVS
                                            10        20        30
[download]

Anyway, I've hacked up a bit of Perl that should help you get started -- it's all I have time for now, Have to attend to a sick Cygwin installation, make waffles for the family, attend a funeral (really) so ..

#!/usr/bin/perl -w

use strict;

while(<DATA>) {
  print "---------------\n";
  if (/^(\s+\d{2,3})+/) { #  Start of block
    print "Analyze:\n$_";

    #  Here I'm just grabbing individual lines from the
    #  fasta output into variables. There's the sample
    #  scale, the sample, the match (dots and colons),
    #  the library and the library scale.

    my $samScale = $_;
    my $sample = <DATA>;
    my $match = <DATA>;
    my $library = <DATA>;
    my $libScale = <DATA>;

    #  I'm using a regular expression to figure out how
    #  how long the leading blanks are and how long the
    #  trailing blanks are.

    my ( $endBlanks, $startBlanks ) =
      $match =~ /^((\s+).+?)\s+$/;
    print "Start at " . length($startBlanks);
    print ", end at " . length($endBlanks) . "\n";

    #  Since the regular expression grabbed the relevant 
    #  pieces of the strong but we just want the length, 
    #  we do that conversino here.

    my ( $start, $end ) =
      ( length($startBlanks), length($endBlanks) );

    #  Done .. print out the matching parts.

    print "Sample match is: " .
      substr($sample,$start, $end-$start) . "\n";
    print "Library match is: " .
      substr($library,$start, $end-$start) . "\n";
  } else {

    #  Skip the parts that appear to be commentary.
    #  Debug code, thuse commented out but left behind.

    # print "Skip:\n$_";
  }
}

__DATA__
        40        50        60        70        80        90       
HAHU   TTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVN
                                     ... ..... .  :  ::: :.. ..: :.
CG1674                              MDSTLNIENVNDPTSIASDLSAENTKADLVS
                                            10        20        30 

       100       110       120       130       140                 
HAHU   FKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR                
       ..  .    .. :. : :: : : : ::.:                              
CG1674 LNEPNVNDQTSSASDLTAENTKADHDSLNKPKDFNNQILNIISDIDINIKAQEKITQLKE
              40        50        60        70        80        90 

>>CG11153-PA type=protein; loc=4:complement(821536..8223  (580 aa)
 initn:  43 init1:  43 opt:  69  Z-score: 84.3  bits: 23.5 E():  1.3
Smith-Waterman score: 69;  45.455% identity (48.387% ungapped) in 33 a
+a overlap (57-89:513-543)

         30        40        50        60        70        80      
HAHU   EALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDL
                                     : ...:: : . :: :..::  : :: :  
CG1115 AEMRQLWCRTGGVSGGSGSLCADACPKGSGGSNSQVAVAAAAAVYHLQDM--ASSAASTA
            490       500       510       520       530         540
[download]

---------------
Analyze:
        40        50        60        70        80        90       
Start at 37, end at 67
Sample match is: NAVAHVDDMPNALSALSDLHAHKLRVDPVN
Library match is: DSTLNIENVNDPTSIASDLSAENTKADLVS
---------------
---------------
Analyze:
       100       110       120       130       140                 
Start at 7, end at 37
Sample match is: FKLLSHCLLVTLAAHLPAEFTPAVHASLDK
Library match is: LNEPNVNDQTSSASDLTAENTKADHDSLNK
---------------
---------------
---------------
---------------
---------------
---------------
---------------
Analyze:
         30        40        50        60        70        80      
Start at 37, end at 65
Sample match is: GHGKKVADALTNAVAHVDDMPNALSALS
Library match is: GSNSQVAVAAAAAVYHLQDM--ASSAAS
[download]

To learn Perl, I highly recommend you get a copy of Learning Perl and then Programming Perl, both excellent books from O'Reilly, available either from your local computer bookstore or over the web.

Perl may be a little difficult to learn, but it's an amzingly powerful tool once you get familiar with it. Good Luck!

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

[reply]
[d/l]
[select]