in reply to Re: Parsing BLAST
in thread Parsing BLAST

I am trying to find which 20mer's are unique to my sequence. I've read the stuff at pasteur and its doesn't really seem to help me for my particular problem. I also have to do this search using FASTA, and have no clue even where to start with that., but that's another bird to kill. thanks!

Replies are listed 'Best First'.
Re^3: Parsing BLAST
by Anonymous Monk on Apr 25, 2006 at 00:28 UTC
    I always parse blast in its -m 8 or -m 9 tabular output format. Much easier to parse.
Re^3: Parsing BLAST
by srdst13 (Pilgrim) on Apr 25, 2006 at 01:58 UTC
    Unless your sequence is quite large (and so you have many thousands of unique 20mers), I would go the hash route. It will be VERY fast if memory isn't limiting. If that isn't feasible, break your sequence into fasta sequences of size 20 base pairs and give each a unique ID. Then, blast away using tabular output. Then, you can parse to your heart's content using simple perl.

    Sean
Re^3: Parsing BLAST
by Anonymous Monk on Apr 25, 2006 at 09:03 UTC
    Is this homework?
      Any suggestions on implementing the hashing methos, or web sites with code I might be able to user/modify? This a part of class project for the bioinformatics class I'm in. The rest of my classmates and I (seven of us.) are all trying to figure this out. The professor has given us some leads, but the code he gave us isn't working right. thanks! -Rob
        Perhaps he gave it to you like that so you could learn how to read and debug the code? The code tutorials will really help you if you stop, breath and then take the tie to go through, understand and then use them.

        The Monks don't usually do your homework for you - its a point of principle that doing your homework doesn't help you learn the language. I'm going to give you some pointers on how you might tackle the problem - its up to you to do something with it. Or not.

        I could structure it something like this
        1. Create a hash of all possible 20mers
        a. Start by making an array containing four strings A,T,G,C
        b. Count the number of array elements you have
        c. For each array element use shift to get it from the left side of the array
        d. add each of the four nucleotides to the shifted element
        e. add each new string back into the right side of the array with push
        f. repeat for each of the original elements in the array
        g. You should end up with 4^20 array elements - 1.0995e13
        h. Use each array element as a hask key and set the value of the key to zero
        i. Thinking about it, the size of the array will get pretty large, so maybe start with four arrays, each containing a nucleotide. This will decrease the final size of the individual arrays by a quarter. You can beak it down even further by creating more arrays ealier, such as create individual arrays for the first 64 combinations (3mers) and then carry on from there. Play with it and see what works best.

        2. Read the files in from your directory:
        a. Read a directory of file names
        b. For each file
        a. grab the sequence and the name
        c. close the file
        d. Process the sequence and the file before starting the next one

        3. Process the file as follows:
        a. Make the sequence one long concatenated string b. You know you want to look at a window of 20 bases, you have to deceide how many bases you want to walk down the sequence, eg read first 20 base window, step down 5 bases, read next 20 base window and so on
        c. For each window, match the window to a hash key and autoincrement the value of the hash key
        d. If you run out of sequence, end the processing

        4. Reporting on the matches a. Use the has to find keys with a value of 0, 1, 2, 3, 4, etc.
        b. You have the sequence name, so print the output as sequence name, patterns with 0 hits, patterns with 1 hit and so on. If you're only interested in single hits for that sequence, then only print those out.
        c. If you use tabs between each value, you can open it in excel as tab delimited text.
        http://www.perlmonks.com/?node_id=9073 This is a fairly straight forward project - really. You should be able to figure it out with the first five chapters of Merlyn's Learning Perl book, which is pretty compact.
        Good luck

        MadraghRua
        yet another biologist hacking perl....

        Question, is there a program that will search all my out files for just ones that have exact matches? That might just slove my problem. Thanks!