I suggest you give BioPerl a try. It might be slower, but you'll be able to do a lot with very little coding.

That statement encapsulates exactly what is wrong with BioPerl. The very essence of dealing with Genome Sequences is that they are huge. Individually, the algorithms used are for the most part very simple. The problem is that they need to be applied to huge volumes of data, and that means that they take a long time unless they are crafted from the get-go to be as efficient as possible.

The standard response to concerns of efficiency here (and more widely) is that programmer efficiency is more important than program efficiency, but this meme completely ignores the wider picture. User time is just as important, and arguably more so, than programmer time.

The big value-add (ROI multiplier) of code, is that once constructed correctly, it can be reused over and over, by many users, for no extra cost beyond that of the time those users spend waiting for it to complete. Therefore, it is positive investment for the programmer to spend some extra time optimising the code he produces--where that code will be re-used, especially for algorithms and processes that will be used for large-scale processing. The extra time (and costs) the programmer spends making the algorithms efficient (once they are correct), amortises many-fold over the lifetime of that code.

In particular, they have a Bio::Seq::LargePrimarySeq module to deal with large sequences such as the ones that you are working with. Give it a go, it's funnier than reinventing wheels poorly.

First, there is an assumption in that statement that the algorithms and implementations in the (huge; over 2000 packages) BioPerl suite, are beautifully rounded wheels. My (very limited) experience of using a few of them is that not all of them are.

They may work, but they often seem to have inconvenient interfaces--requiring glue code to allow the outputs of one module to be used as inputs to the next, even when those modules are both a part of the overall BIO suite. And they are often not written with a view to efficiency--which given the nature and scale of the problems they are designed to address, should be a primary goal.

For example, for the user to obtain the actual data for a given sequence from a fasta format file, try chasing through the number of times that sequence gets copied--from the point where it is read from the file, to the point where it is actually delivered to the user for use. Much of that copying could be avoided by passing references to scalars rather that the scalars themselves.

The shear scale of the Bio:: namespace is the other problem. You said: "you'll be able to do a lot with very little coding.", but that only holds true once you have discovered the appropriate modules and methods to use. And given the 2000+ component modules, that is a steep learning curve for a full-time programmer--let alone your average genomic biologist for whom the code is just a means to an end.

For example, you;ve suggested Bio::Seq::LargePrimarySeq, but why that module rather than say Bio::SeqIO::fasta or Bio::SeqIO::largefasta? And where is the breakpoint at which you should transition from using one to the other?

For the OPs question, which needs to deal with the sequences in the input files one at a time, the overhead of using the module you've recommended--taking those files and splitting them into a bunch of temporary files each containing one sequence--is exactly that. Unnecessary, pure overhead.

I realise that the Bio suite contains advanced methods for parallelising (clustering) solutions, and that for labs who deal with large scale genomic processing on a regular basis, the investment and set-up costs to allow the utilisation of those techniques will be far more cost effective than optimising the serial processing. But for many of the people working in this area, those setup and hardware investments are a non-starter. They have to do what they need to do, with whatever equipment they have to hand. And the more efficiently they can process their data, the less time they'll spend waiting to see if their experiments are achieving results, allowing them to try more ideas and variations.

For well-known processing--payroll, inventory etc.--of large scale data volumes done on a regular basis, it is usually possible to schedule that processing to run in the dead time of night, or over weekends, and whether it takes 2 hours or 20 doesn't really matter. But for the kind of experimental--what if I try this; or this; or this; or this;--investigations that typify the work of small players in new fields like genomics, the time taken to see the results of each investigation often comes straight out of the overall time allotment. And time spent waiting--whether trying to work out what combination of modules and methods to use; or for the results to be produced--is just dead time.

The bottom line is that if the basic algorithms and implementations are written to operate efficiently, they will also save time (and money; eg. hardware costs), when used in clustered/parallel environments. So the failure to optimise the core algorithms and implementations of complex frameworks, is not hubris and laziness in the Perlish good-sense of those terms, but rather arrogance and negligence!


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^2: Lower-casing Substrings and Iterating Two Files together by BrowserUk
in thread Lower-casing Substrings and Iterating Two Files together by neversaint

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.