in reply to Re: Lower-casing Substrings and Iterating Two Files together
in thread Lower-casing Substrings and Iterating Two Files together

I suggest you give BioPerl a try. It might be slower, but you'll be able to do a lot with very little coding.

That statement encapsulates exactly what is wrong with BioPerl. The very essence of dealing with Genome Sequences is that they are huge. Individually, the algorithms used are for the most part very simple. The problem is that they need to be applied to huge volumes of data, and that means that they take a long time unless they are crafted from the get-go to be as efficient as possible.

The standard response to concerns of efficiency here (and more widely) is that programmer efficiency is more important than program efficiency, but this meme completely ignores the wider picture. User time is just as important, and arguably more so, than programmer time.

The big value-add (ROI multiplier) of code, is that once constructed correctly, it can be reused over and over, by many users, for no extra cost beyond that of the time those users spend waiting for it to complete. Therefore, it is positive investment for the programmer to spend some extra time optimising the code he produces--where that code will be re-used, especially for algorithms and processes that will be used for large-scale processing. The extra time (and costs) the programmer spends making the algorithms efficient (once they are correct), amortises many-fold over the lifetime of that code.

In particular, they have a Bio::Seq::LargePrimarySeq module to deal with large sequences such as the ones that you are working with. Give it a go, it's funnier than reinventing wheels poorly.

First, there is an assumption in that statement that the algorithms and implementations in the (huge; over 2000 packages) BioPerl suite, are beautifully rounded wheels. My (very limited) experience of using a few of them is that not all of them are.

They may work, but they often seem to have inconvenient interfaces--requiring glue code to allow the outputs of one module to be used as inputs to the next, even when those modules are both a part of the overall BIO suite. And they are often not written with a view to efficiency--which given the nature and scale of the problems they are designed to address, should be a primary goal.

For example, for the user to obtain the actual data for a given sequence from a fasta format file, try chasing through the number of times that sequence gets copied--from the point where it is read from the file, to the point where it is actually delivered to the user for use. Much of that copying could be avoided by passing references to scalars rather that the scalars themselves.

The shear scale of the Bio:: namespace is the other problem. You said: "you'll be able to do a lot with very little coding.", but that only holds true once you have discovered the appropriate modules and methods to use. And given the 2000+ component modules, that is a steep learning curve for a full-time programmer--let alone your average genomic biologist for whom the code is just a means to an end.

For example, you;ve suggested Bio::Seq::LargePrimarySeq, but why that module rather than say Bio::SeqIO::fasta or Bio::SeqIO::largefasta? And where is the breakpoint at which you should transition from using one to the other?

For the OPs question, which needs to deal with the sequences in the input files one at a time, the overhead of using the module you've recommended--taking those files and splitting them into a bunch of temporary files each containing one sequence--is exactly that. Unnecessary, pure overhead.

I realise that the Bio suite contains advanced methods for parallelising (clustering) solutions, and that for labs who deal with large scale genomic processing on a regular basis, the investment and set-up costs to allow the utilisation of those techniques will be far more cost effective than optimising the serial processing. But for many of the people working in this area, those setup and hardware investments are a non-starter. They have to do what they need to do, with whatever equipment they have to hand. And the more efficiently they can process their data, the less time they'll spend waiting to see if their experiments are achieving results, allowing them to try more ideas and variations.

For well-known processing--payroll, inventory etc.--of large scale data volumes done on a regular basis, it is usually possible to schedule that processing to run in the dead time of night, or over weekends, and whether it takes 2 hours or 20 doesn't really matter. But for the kind of experimental--what if I try this; or this; or this; or this;--investigations that typify the work of small players in new fields like genomics, the time taken to see the results of each investigation often comes straight out of the overall time allotment. And time spent waiting--whether trying to work out what combination of modules and methods to use; or for the results to be produced--is just dead time.

The bottom line is that if the basic algorithms and implementations are written to operate efficiently, they will also save time (and money; eg. hardware costs), when used in clustered/parallel environments. So the failure to optimise the core algorithms and implementations of complex frameworks, is not hubris and laziness in the Perlish good-sense of those terms, but rather arrogance and negligence!


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re^2: Lower-casing Substrings and Iterating Two Files together

Replies are listed 'Best First'.
Re^3: Lower-casing Substrings and Iterating Two Files together
by bruno (Friar) on Apr 27, 2009 at 20:14 UTC
    There is little to argue about; most of what you've said I agree with.

    I agree, for example, that BioPerl is not a perfectly rounded wheel. It has rough edges and some parts could use some love. But I also think that the solution is not to turn your back against it on the basis of poor performance or interface clunkiness alone.

    My general advice for anyone doing bioinformatics in Perl is: learn BioPerl. It's true that it's big, but that's because biology is big. But you don't have to know it all before you start using it, so the cognitive overhead that you talk about is really not that big.

    If you have a problem and a after a quick search you find that there's a BioPerl module that solves it, chances are that it will be your last stop: the solution will most probably be correct and tested. But if you have performance issues, then you can start thinking of profiling and looking for alternative solutions, among which there's contributing to the BioPerl module in question to try to make it faster.

    That concept above is (and I think that you'll agree with me here) essential for any open source project to move forward.

    As you may know, most of us doing research in computational biology are programmers by training, and sometimes implement less than optimal algorithms or have poor programming style; this is evident when looking at BioPerl's code. We are, however, open to anyone (biologists or not) willing to make contributions and improvements to the code; much of the tasks in this area don't require any bio-knowledge.

      But I also think that the solution is not to turn your back against it on the basis of poor performance or interface clunkiness alone.

      The problems with BioPerl go so much deeper than just "poor performance" or "interface clunkiness".

      • Try abysmal performance:

        The last time I tried Bio::SeqIO::Fasta, it was three orders of magnitude slower than 4 lines of Perl that achieved the desired result. Read a Fasta format file and return the identifiers and sequences sequentially.

        I would have run a new benchmark to see if it had improved any in the interim couple of years, but that requires downloading and install 30MB of stuff spread across 2000 files. Never mind the quality--feel the width.

        So what you are probably saying. Disk space is cheap. True, but mind-space isn't! Here's just one of the ways installing that lot adversely affects me. The next time I re-build the index to my Perl documentation, that lot will get added--and triple it's size. And as it appears low in the alphabet, it means that I will have to labour my way past all that most every time I want to look up something useful. Small potatoes you might think, but the last time I made the mistake of installing it, I ended up blowing away my whole installation to get rid of it, because it was like carrying a heavy raincoat for a walk on the beach. Dead weight.

      • Try "clumsy", "hideous" or "tortuous" interface.

        Like trying to drive a car using a keyboard interface. Unintuitive, laborious and verbose. The triumph of OO-spaghetti (which is far, far worse than procedural spaghetti), over practicality. The clue to some of the problems is indicated by the repetition of things starting with "seq" in these synopsises:

        $sequence = $seqIO->next_seq() Fetch the next sequence from the stream. $seqIO->write_seq($sequence [,$another_sequence,...])

        Can a seqIO object read or write a (for example) disarray? If not, why not just $seq->next & seq->write?

        I see from the latest docs that it (now? I don't recall seeing it before), sports a tied interface, but the comments in the description sum up the attitude:

        # The SeqIO system does have a filehandle binding. Most people find t +his # a little confusing,

        The patronising attitude that you must somehow hide Perl from users. The idea that somehow this:

        while ( my $seq = $in->next_seq() ) { print "Sequence ",$seq->id, " first 10 bases ", $seq->subseq(1,10), "\n"; }

        Is better than this:

        while( my( $id, $seq ) = <$tiedFasta> ) { say "Sequence $id first 10 bases ", substr( $$seq, 0, 9 ); ]

        Perl works! That's why we use it in preference to other languages. Especially, Perl's file and string handling works far better than Java. So why wrap over Perl's strengths with Java-wanna-be OO wrappers. OO done well can be a productivity aid, but done badly (which is most of the time), it means that substr is called subseq in one context, and subString in another context; and sub_string in yet another context; and ...

        And all of those pseudonyms have to be looked up individually and are poor substitutes for the real thing. eg. No Lvalue context or 4 arg variants. That's a problem because if you want to operate on subsequences, using substr you can do things like:

        substr{ $$seq, $_, 12 ) =~ m[...] for 0 .. length( $seq ) - 12;

        to examine (or modify) all 108,000 12-char subsequences in the first sequence of Drosophila without any copying!. Do the same thing with the subseq method and you'll end up having to copy 1.2 MB of data--for the first sequence alone. And that's for read-only access only.

        Apply this same process to all 164MB of Drosophila and you'll end up copying 1,921,667,736 bytes of data in 12-byte chunks--when there is no need to copy any!

        And the after-though tied interface provided doesn't help much because it's just a wrap-over of the OO interface which means it does even more copying and is even slower.

      And these are just a few of the problems with one tiny part of this behemoth. And they are endemic. O'Woe engineering built atop performance-doesn't-matter architecture.

      There is simply no logic to wrapping 7 layers of OO over Perl's powerful built-ins in order to read and write simple file formats. But that horse long since flew the coupé :)

      We are, however, open to anyone (biologists or not) willing to make contributions and improvements to the code; much of the tasks in this area don't require any bio-knowledge.

      As Pat said, when asked for directions: "Ah now! If I were you trying to get to there, I wouldn't be starting from here."

      The problem is that the problems run so deep, that you cannot patch-fix the implementation whilst leaving the architecture and interfaces intact. And any attempt by a non-biologist outsider to suggest changing the architecture, interfaces and implementation; would be like an Englishman suggesting the US change it's gun laws. It just ain't gonna happen. All the considerations of backward compatibility and installed base, compounded by vested interests, long standing contributions and NIH.

      About the best thing that could be done is to go for a Bio::Lite. A few small modules with minimal interfaces optimised to work with rather than against Perl's native abilities.

      1. Half a dozen PerlIO layers for reading and writing the basic file formats.
      2. A few Genome-tailored regex generators to simplify searching and fuzzy-matching the basic ACGT & extended ACGTNXacgt sequence formats.
      3. A couple of wrap-overs of one of the Math* modules to provide the more commonly used statistical tools.
      4. And most importantly, some tailored, worked examples of using some of Perl's more esoteric built-in facilities--like pack/unpack and bit-wise string operations to perform the more common manipulations.

      If anyone was to start that project, you could count me in, as I think that Genome research is one of the most worthwhile areas of open source development around. But it would require someone with a decent understanding of the field to head-up such a project, otherwise you'd just end up with another programmers view development, instead of a User's Needs driven one. And that would benefit no one.

      In theory, Perl 6/Parrot would make a good basis for a new Bio-project, bringing the best of OO, functional, and perhaps even parallelism to the table to provide for a clean and efficient solution. But even then, it would be a surprise to me if anything more was done than just to re-implement the existing Bio-Perl interfaces as quickly and directly as possible, without spending any time exploring how the new language and facilities it provides could be best utilised.

      It will probably take until BioPerl6-II, for people to have become sufficiently familiar with Perl 6 to begin to see the possibilities--and by then, 2 or 3 Christmases after the Christmas delivery of Perl 6--the existing interfaces will be too ingrained, and installed base will be too large, to consider radical changes. And it would take radical changes to address the problems.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        In theory, Perl 6/Parrot would make a good basis for a new Bio-project, bringing the best of OO, functional, and perhaps even parallelism to the table to provide for a clean and efficient solution. But even then, it would be a surprise to me if anything more was done than just to re-implement the existing Bio-Perl interfaces as quickly and directly as possible, without spending any time exploring how the new language and facilities it provides could be best utilised.

        BioPerl has already started its way to Moose (BioMoose) as a first step towards BioPerl6. But after reading the seminal posts on the BioPerl mailing list, I'm afraid that you are right in your predictions. Sadly, I don't expect more than a porting re-implementation

        citromatik