uvnew has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. I need to present the human genome as a multi-dimensional array. The input file text looks like that:

>chromosome_1

AATTGGCC...

>chromosome_2

GGGTACGA...

.

.

>chromosome_17

AGACTTGA...

And so on. I would like each row to be one chromosome, and each column to be one nucleotide (character), so in this example $Human_genome[1][3] will be 'T' and $Human_genome[16][2] is 'A'. The size of each chromosome is different (total of about 3 billion characters). I have enough RAM for this task. Many thanks for any suggestion!

Replies are listed 'Best First'.
Re: Human genome array
by AnomalousMonk (Archbishop) on Jul 19, 2011 at 17:45 UTC

    As you have already seen from at least one of the responses to your OP, not revealing (in code or at least narrative form) the unsuccessful approaches you have already attempted to a problem is likely to result in frustration all around.

    There are many monks lurking about the place who are vastly better qualified than I to respond to your OPed question were you to supply more basic info on what you want to achieve and what you have tried. I can say that the overhead of each Perl array element is many bytes (16? 32? More?) of RAM, so what is at first glance a 3 Gig array is, in reality, much larger. However, tools like PDL (see also PDL::FAQ) are designed to operate on large, multi-dimensional character and integer arrays of this kind as 'raw' data. Of course, if each chromosome of the genome can be represented as a string (resulting in an array of, e.g., 46 elements), then we are back again to something like 3 Gig. As always, more info will be helpful to all.

      Thank you very much for your suggestion and feedback. In the future I will definitely add my code and all info I have to make things clearer.
Re: Human genome array
by BrowserUk (Patriarch) on Jul 20, 2011 at 00:56 UTC
    I need to present the human genome as a multi-dimensional array. ... about 3 billion characters). I have enough RAM for this task.
    1. Why do you "need" to?

      Why do you need to have the entire genome in ram at the same time? Why do you need to store it in a 2D array?

    2. Do you realise that storing 3 billion characters each as a separate scalar in a 2D perl array will require approximately 150 GB of ram?

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Human genome array
by Anonymous Monk on Jul 19, 2011 at 15:50 UTC
    1. Read in a line
    2. Split it into characters
    3. put the characters into an array
    4. push a ref to that array into the main array.
    5. Repeat until done, or out of resources.

    Also, try it yourself first, don't ask us to do your work for you unless you're willing to let us cash your paycheck for you too.

      Thank you for your reply. I have attempted to do it myself for several hours and asking in this forum is a result of my lack of success to solve the problem. My broken code wouldn't help to make anything clearer. Thanks anyway.
        Post your code anyway! We can probably tell how you're trying to do things and what your incorrect assumptions are. You give a little, we give a lot back.

        I have attempted to figure out what is wrong with your code for several minutes, and am posting to inform you of my lack of success. My crystal ball couldn't help to make anything clearer.

        On the bright side, the magic 8-ball did recommend that you should try adding use strict; use warnings; at the top of the script.