Perhaps he gave it to you like that so you could learn how to read and debug the code? The code tutorials will really help you if you stop, breath and then take the tie to go through, understand and then use them.

The Monks don't usually do your homework for you - its a point of principle that doing your homework doesn't help you learn the language. I'm going to give you some pointers on how you might tackle the problem - its up to you to do something with it. Or not.

I could structure it something like this
1. Create a hash of all possible 20mers
a. Start by making an array containing four strings A,T,G,C
b. Count the number of array elements you have
c. For each array element use shift to get it from the left side of the array
d. add each of the four nucleotides to the shifted element
e. add each new string back into the right side of the array with push
f. repeat for each of the original elements in the array
g. You should end up with 4^20 array elements - 1.0995e13
h. Use each array element as a hask key and set the value of the key to zero
i. Thinking about it, the size of the array will get pretty large, so maybe start with four arrays, each containing a nucleotide. This will decrease the final size of the individual arrays by a quarter. You can beak it down even further by creating more arrays ealier, such as create individual arrays for the first 64 combinations (3mers) and then carry on from there. Play with it and see what works best.

2. Read the files in from your directory:
a. Read a directory of file names
b. For each file
a. grab the sequence and the name
c. close the file
d. Process the sequence and the file before starting the next one

3. Process the file as follows:
a. Make the sequence one long concatenated string b. You know you want to look at a window of 20 bases, you have to deceide how many bases you want to walk down the sequence, eg read first 20 base window, step down 5 bases, read next 20 base window and so on
c. For each window, match the window to a hash key and autoincrement the value of the hash key
d. If you run out of sequence, end the processing

4. Reporting on the matches a. Use the has to find keys with a value of 0, 1, 2, 3, 4, etc.
b. You have the sequence name, so print the output as sequence name, patterns with 0 hits, patterns with 1 hit and so on. If you're only interested in single hits for that sequence, then only print those out.
c. If you use tabs between each value, you can open it in excel as tab delimited text.
http://www.perlmonks.com/?node_id=9073 This is a fairly straight forward project - really. You should be able to figure it out with the first five chapters of Merlyn's Learning Perl book, which is pretty compact.
Good luck

MadraghRua
yet another biologist hacking perl....


In reply to Re^5: Parsing BLAST by MadraghRua
in thread Parsing BLAST by cumurph

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.