in reply to unique sequences

Hi Anonymous Monk,

it is difficult to guess what you should be obtaining without seeing the input, but the output get is in line with the code you've shown. Your code is basically discarding the "comments" (that's how I call the lines starting with >, for lack of a better description) and then looks for sequences of ten nucleotides (I hope this is the right term) followed by GG. And that's pretty much what you have in your output. So, to me, you get what you ask for.

Please explain in plain English what you need to extract and in which respect the output you get is not what you want or need.

As a side note, it may or may not be relevant or important, but please remember that a hash does not preserve the order in which the data were populated into it.

Replies are listed 'Best First'.
Re^2: unique sequences
by BillKSmith (Monsignor) on Dec 11, 2017 at 20:39 UTC
    The name of the file-handle suggests that the input is in FASTA format. The reference article indicates that a file may contain more than one sequence. Each sequence is prefixed with a '>' line which specifies its name and may contain comments. If the input contains more than one sequence, the script combines them. If the file is known to contain only one sequence, it is overkill to test every line for '>'.
    Bill