in reply to Re^2: genetic algorithm for motif finding
in thread genetic algorithm for motif finding

You can learn a bit of bioinformatics at Rosalind. By solving problems, you not only gain XP and contest, but also learn about the underlying science.
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
  • Comment on Re^3: genetic algorithm for motif finding

Replies are listed 'Best First'.
Re^4: genetic algorithm for motif finding
by BrowserUk (Patriarch) on Aug 14, 2013 at 09:45 UTC

    I took a look at 2 or 3 of the problems in the "bioinformatics stronghold" section -- I picked those that seemed to have the lowest number of solutions assuming they would be the hardest problems -- and for those I tried the solutions were trivial. Usually simple one-liners.

    I see little incentive to (re-)solving such simple problems; nor did I really learn that much about the terminology from doing them. In the end, I'm never going to be a bioinformatician -- no real access to real world data or problems -- so my becoming conversant in the terminology doesn't really benefit anyone.

    My interest is solely the computational and algorithmic challenges that the field presents, and for that I just need the problems described in terms I understand; and access to the (real or example) datasets.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^4: genetic algorithm for motif finding
by BrowserUk (Patriarch) on Aug 15, 2013 at 16:43 UTC

    By /msg you said:

    Re Re^4: genetic algorithm for motif finding The problems start with the simple ones, you have to solve them to get further. I found for example EDTA or RNAS to be a bit more complicated.

    I have no idea if you have anything to do with the Rosalind site; but if you do, please do not be offended by this. It's just my opinion :) Your attempt to help me is much appreciated.

    The problem I see with the Rosalind site is this: The tutorials and challenges are all geared to leading the programmer to solve the problems in one particular way. In the case of the two examples you cite are Edit Distance & graphs respectively. Both of which are a problem.

    • Edit distance.

      There are many different algorithms for this Levestien; Wagner-Fischer; etc. and they are all horribly inefficient O(mn) compared to simply xoring teh strings and counting the nulls:

      $n = ($a ^ $b ) =~tr[\0][];

      Which is O(2N). And as both N are implemented in C (opcodes); they results can be orders of magnitude faster for DNA length strings.

    • Graphs.

      Unless a new graphs module has appeared on the scene (cpan) recently; implementing graphing algorithms in Perl is horribly slow and hugely memory hungry.

      Graphs lend themselves to being implemented using OO-style with a Nodes/Edges/AdjacencyMatrix/Attributes/Weights/etc classes all based around blessed Hashes (or worse!) and rendering the simplest of graphs constructed with them huge, cumbersome and slow.

      The idea of solving genomic problems by creating graphs of entire genes with every base a node and edges linking the pairs is just a non-starter in Perl using native Perl graphing libraries. They just require too much memory and processor.

      That is probably why so much of genomic work is parceled up and sent of to mainframes or clusters (BLAST servers and the like) with gobs of memory and huge processing power, to do the donkey work. But that in turn creates its own set of knock on problems:

      1. These batch processors tend to produce far more information than most querants require. (say) Producing edit distances and other quality statistics; when often only a boolean yes or no is required.
      2. Their output is often-as-not in a form -- multi-line 'carded' records and the like -- which make extracting the required results almost as complex as solving the original problem.

    The silly thing is, that X86-64 class machines are actually very good at string processing; and many of the tasks can be tackled very efficiently using them; once you stop viewing the problems in terms of graph theory and look for string manipulation solutions.

    That's why when these DNA/RNA problems come up; I ask for and try to extract very simple descriptions of the problems that aren't couched in either biogenetic terminology nor the mathematical symbolism applicable to just one approach to solving the problem. Unfortunately, these requests usually fall on deaf ears.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I am just a normal user of the site. I use it rarely as a training of my coding skills, so no offence taken.

      Well, if insertion and deletion are also involved in the edit distance, a simple XOR cannot solve the problem. The graph solution to RNAS would be terribly slow (not only in Perl) and one would run out of time (there is a 5 minutes limit). You have to find a way how to solve the problem without going into representing the graphs at all. That was what took me some time, once I knew the algorithm, the implementation was a routine.

      It is true that most of the simpler problems can be solved by one-liners in Perl. The site does not care about how you solve the problems: you just download the test data and have 5 minutes to upload the solution. That gives us Perlers an easier start on the ladder.

      I agree with you that the jargon used by some bioinformaticians is incomprehensible. I fear the reason sometimes is they do not understand the underlying problems in proper detail, it is just a form of handwaving.

      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re^4: genetic algorithm for motif finding
by Anonymous Monk on Aug 14, 2013 at 02:20 UTC

    You can learn a bit of bioinformatics at Rosalind. By solving problems, you not only gain XP and contest, but also learn about the underlying science.

    You can learn half jargon here

    Seriously, the problem with bioinformatics they want to turn every programmer into a bioinformatician -- its like wanting to turn every electricial into an electrical engineer

      Not sure what you mean. The point isn't to have every programmer be a bioinformatician. For those programmers that are interested, then they can choose to look further into the application of programming, etc. to biology. The number of posts on PerlMonks by bioinformaticists is due to the long legacy of Perl in Systems Administration and Bioinformatics (where SAs were the first bioinformaticists for the most part).

      Bioinformatics

        ... Not sure what you mean...

        humor doesn'T penetrATe?:)

        In The hisTory of perlmonks, well over 95% bio...quesTions do noT ATTempT To eliminATe bio..jArGon

        so when The quesTion CAn explAin wiThouT jArGon? reCeives A (well inTenTioned) response of you CAn leArn some jArGon The hArd wAy , A smAll joke spAwns pokes iTs TenTACle