baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am looking for a perl environment for building pipelines. An environment that will allow be to easily document my pipes (something like a doxygen to create both latex and html documentation), that will allow me to easily and automatically split the input and fork the process (basic paralelization), that will allow me to combine tools in different ways, that will not bug me with the environmental setup and that can be run in a pre-set env, both as a user and a superuser...

What I started to build is a system that essentially creates a set of modules for each tool utilized which eliminates a lot of background work and allows to hardcode some of the tools. But given the amount of work and many pitfalls of my system I was wondering whether someone has already tried to do something like that.

thank you

baxy

PS

A similar but toooooo fancy, not cli based and complicated (not flexible in any aspect ) system is galaxy
Any advice ?

Replies are listed 'Best First'.
Re: Perl pipeline builder
by Corion (Patriarch) on Feb 08, 2018 at 11:59 UTC

    I assume that by "pipeline" you mean a sequence of multiple transformations that you want to apply to your data. Ideally, you want to be able to run pars of these transformations in parallel.

    I see various degrees of hairyness that you can apply here.

    The easiest approach, if your data model fits it, is to use make and a Makefile. This buys you easy/trivial restartability and trivial parallelization. On the downside, all your data must reside in files and the rules to get from one set of files to another (set of) file(s) are fairly restricted. Especially, I think, because make can only have one output file for a set of input files and not a set of output files. One aspect of make is that it requires you to think backwards from the result you want to the intermediate results until you get to a point where you can start from.

    The next best approach is any of the various job queues into which you stuff the steps of your pipelines. I know of Minion, Queue::Dir and Directory::Queue and various modules in the Job namespace. These all model each job (step) as a separate program or module. The advantage is that you get fairly simple restartability and parallelization/distribution, even across machines. The downside is that you will have to adapt your existing programs/modules to whatever mechanism you choose and think about whether you will push all steps for a given job into the job queue at once or whether each job should know about the next step that should be taken after it has finished. The approach of a job queue is more forward-thinking, as you know what you start out with and likely already know the next step to be taken in each case.

    Depending on how far you can/want/need to go when implementing your jobs, maybe Workflow is something that you can use. This could allow you to organise the sequence of steps in a central location. Also, Workflow (or anything like it) allows you to have loops and retries, things that are hard to model in make.

    Personally, I have only used/written hard-coded sequences and nothing that was nice to generate configuration from. The most sophisticated and never used idea I toyed with was something like make which could also look at SQL tables and run SQL statements to see whether rules are satisfied, but it quickly grew too far in complexity.

Re: Perl pipeline builder
by jahero (Pilgrim) on Feb 08, 2018 at 12:27 UTC