Google Summer of Code participants sought

The Phyloinformatics group of the National Evolutionary Synthesis Center (in Durham, NC, USA) is seeking students - competent and creative Perl programmers, not limited to the US - to participate in this year's Google Summer of Code. The following section contains the standard invitation:

Phyloinformatics Summer of Code 2007

A collaborative Phyloinformatics Group, sponsored by the National Evolutionary Synthesis Center, is working to develop user-interfaces, improve software interoperability and support data exchange standards in evolutionary bioinformatics. The specific projects are diverse in nature and range from the development of AJAX components for web-based bioinformatics applications, managing workflows using approaches from functional and logic programming, and developing data exchange standards for phylogenetic substitution models.

The Phyloinformatics group will be sponsoring student collaborators through the Google Summer of Code program, which provides undergraduate, masters and PhD students with a unique opportunity (over three summer months) to obtain hands-on experience writing and extending open-source software under the mentorship of experienced developers from around the world. We are particularly targeting students interested in both evolutionary biology and software development. Students will have one or more dedicated mentors with expertise in phylogenetic methods and open-source software development. Our project proposals are flexible and can be adjusted in scope to match the skills of students with less programming proficiency. If the program sounds interesting to you but you are unsure whether you have the necessary skills, please email the mentors at phylosoc {at} nescent {dot} org (note: or just contact me through PM). We will work with those who are genuinely interested to find a project that fits your interest and skills. Students will receive a stipend from Google and will be invited to participate in future collaborative events such as the NESCent Phyloinformatics Hackathons.

TO APPLY: Students must apply on-line at the Google Summer of Code website. The application period for students is now open and ends on Saturday, March 24, 2007 (one week from now).

The Phyloinformatics Summer of Code project and ideas page is at the following URL: http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2007

The above page also contains links to the GSoC program rules, eligibility requirements, and stipend payment mechanism. We encourage all interested students to email any questions, or self-proposed project ideas, to phylosoc {at} nescent {dot} org. This will reach all prospective mentors.

To give you a more specific example of what you might be doing, the next section outlines the project I will be mentoring:

Introduction

Phylogenetic analysis is the branch of evolutionary biology that deals with reconstructing the Tree of Life. Through comparative analysis of the pattern of differences and similarities between extant species - at the molecular level, morphological level, behavioural level - phylogeneticists seek to infer the patterns of relationships between species. In recent years, with the advent of cheap and fast dna analysis techniques, phylogenetics has grown tremendously in size, scope and sophistication - and a new field has emerged: that of phyloinformatics. Whereas ten years ago, phylogenetic analyses would be performed on sets of perhaps a dozen species, we now seek to do the same for hundreds or thousands of species. This poses problems in terms of data management, data standards, and so on.

The problem

In the 1980s, when some of the techniques we use were first developed, managing phylogenetic data was something a dedicated graduate student could do by hand. Input files for analyses would be copied and pasted together, fed into the analysis software one at a time, and the inevitable typos would then be hunted down. In that climate, a flat file format (called '#NEXUS') was designed, and was soon widely adopted by phylogenetic software developers. Unfortunately, this format is now collapsing under its own success: several different and incompatible "dialects" have emerged over time, and the lack of a formal grammar or other means of validation has caused problems that become increasingly difficult to fix. We clearly need a new approach, such that data can be validated and more reliably exchanged between applications.

NeXML

Several XML formats have been proposed as a means to represent evolutionary trees and comparative data. Most of these proposals were never adopted; one reason for that is that these formats were presented without reference to the commonly used software toolkits: a new format is all well and good, but without a proof-of-concept of efficient serialization between XML and, say, Bio::Phylo the proposals are not of great immediate utility.

What you can do

Our invitation to potential applicants is to design an XML format that maps efficiently onto the object model of any one of the commonly used perl toolkits (bioperl, Bio::Phylo, Bio::NEXUS) for phyloinformatics. In particular, the requirements are as follows:

The format can express the fundamental entities common to all aforementioned toolkits: comparative data (e.g. dna sequences), evolutionary trees.
The format can be validated using DTD or XML schema.
The format allows for generic annotations (e.g. key/value pairs) to be attached to the data objects
The format is optimized for minimal memory requirements, both in file size, and in requirements on parsers. For example, element nesting should be shallow, trees should be represented in an ordered way (e.g. node elements depth first ordered) to minimize the number of back references that need to be maintained during parsing.
The format uses UIDs and ID references to maintain referential integrity between entities.
Successful completion of this project includes the implementation of a parser/file writer that serializes the objects of any one of the aforementioned toolkits (preferably Bio::Phylo, which has an adaptor architecture to pass objects on to the other toolkits).

What you'd be doing in practice

Study the toolkit APIs, find out what sort of metadata is being attached to the fundamental objects (of course we'll explain what's going on).
Sketch out the xml format (this always seems to be an iterative process).
Create some huge instances of this format (we have reference data sets).
Show how fast/efficient your format is parsed. Here your design will have to strike a balance between rich annotation and efficient memory usage. To give an example: the Tree of Life website has an xml service where parts of the tree can be retrieved based on the ID of the requested split in the tree. The root of the tree has ID=1, and so this is the entire tree: watch out!. The principal issue with this design is that the tree shape is represented using (deeply) nested elements. This requires the parser to hold huge chunks in memory. A better solution would be to have the nodes in the tree organized along the lines of <branch><child id="something" /><parent idref="something_else" /></branch>
Stay in touch through email/teleconferencing/im/irc. Commit code. GSoC is not bound to any one location - I am in Berlin, you could be elsewhere.

What you'd get out of it

An outlet for your creativity as a programmer
A stipend paid by Google (US$4500)
An impressive addition to your resume
Invitations to future NESCent hackathons
Experience with a dynamic, emerging field in bioinformatics
If your format is adopted by Bio::Phylo, co-authorship on a forthcoming publication that presents the API

Note that we will also consider independent proposals that address the phyloinformatics problem space - so don't hesitate to contact us, the deadline is approaching rapidly (March 24th).

Disclaimer: I am no recruiter or headhunter. I understand perlmonks is no job postings board and I apologize in advance if it looks like I abuse PM for this purpose - but I have also noticed an interest in GSoC in the past, and I am looking for good coders. I know they're here.

Comment on Google Summer of Code participants sought Download Code


No such thing as a small change
	PerlMonks