comment on

G'day Pete,

[I was going to comment on your code, but that's already been done. I'll just say I concur and encourage you to read and understand the excellent advice from ++Discipulus.]

Biological data is typically huge and you need to consider this when dealing with it. Avoid multiple loops. Don't automatically reach for a regex. Consider Perl's string handling functions, such as index and substr, which I demonstrate below: these are (almost) always faster than their regex equivalents.

One piece of information you omitted was how many uniq "hsa_circ_0000001"-type elements you have: this will equate to how many files you'll need to create. In the code below, I've assumed that the number is small enough that you could have them all open at once. If this isn't case, you can still use much the same technique, but you'll need to implement some sort of record of file usage: keeping open, files you're writing to often; closing and reopening the least used ones as required.

I dummied up some input data:

$ cat pm_1193237_input.fasta
>qwerty|111|222|999
AAAAAAAAAA
>asdfgh|333|444|888
CCCCCCCCCC
>zxcvbn|555|666|777
GGGGGGGGGG
>plokmi|777|888|666
TTTTTTTTTT
>qwerty|111|222|555
AAAAAAAAAA
>asdfgh|333|444|444
CCCCCCCCCC
>zxcvbn|555|666|333
GGGGGGGGGG
>plokmi|777|888|222
TTTTTTTTTT
[download]

Then ran this script:

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my $infile = 'pm_1193237_input.fasta';

{
    my $out_fh;

    open my $in_fh, '<', $infile;

    while (<$in_fh>) {
        my $pos = index $_, '|';

        if ($pos == -1) {   # Sequence
            print $out_fh $_;
        }
        else {              # Header
            print { get_fh(\$out_fh, substr $_, 1, $pos - 1) } $_;
        }
    }

    close $in_fh;
}

close_out_fhs();

{
    my %fh_for;

    sub get_fh {
        my ($fh, $name) = @_;

        unless (exists $fh_for{$name}) {
            open $fh_for{$name}, '>', gen_fname($name);
        }

        $$fh = $fh_for{$name};
    }

    sub close_out_fhs { close $_ for values %fh_for }
}

sub gen_fname { 'pm_1193237_output_' . $_[0] . '.fasta' }
[download]

Which produced these files:

$ cat pm_1193237_output_qwerty.fasta
>qwerty|111|222|999
AAAAAAAAAA
>qwerty|111|222|555
AAAAAAAAAA
$ cat pm_1193237_output_asdfgh.fasta
>asdfgh|333|444|888
CCCCCCCCCC
>asdfgh|333|444|444
CCCCCCCCCC
$ cat pm_1193237_output_zxcvbn.fasta
>zxcvbn|555|666|777
GGGGGGGGGG
>zxcvbn|555|666|333
GGGGGGGGGG
$ cat pm_1193237_output_plokmi.fasta
>plokmi|777|888|666
TTTTTTTTTT
>plokmi|777|888|222
TTTTTTTTTT
[download]

— Ken

In reply to Re: An overlapping regex capture by kcott
in thread An overlapping regex capture by Peter Keystrokes

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.