Re: An overlapping regex capture

G'day Pete,

[I was going to comment on your code, but that's already been done. I'll just say I concur and encourage you to read and understand the excellent advice from ++Discipulus.]

Biological data is typically huge and you need to consider this when dealing with it. Avoid multiple loops. Don't automatically reach for a regex. Consider Perl's string handling functions, such as index and substr, which I demonstrate below: these are (almost) always faster than their regex equivalents.

One piece of information you omitted was how many uniq "hsa_circ_0000001"-type elements you have: this will equate to how many files you'll need to create. In the code below, I've assumed that the number is small enough that you could have them all open at once. If this isn't case, you can still use much the same technique, but you'll need to implement some sort of record of file usage: keeping open, files you're writing to often; closing and reopening the least used ones as required.

I dummied up some input data:

$ cat pm_1193237_input.fasta
>qwerty|111|222|999
AAAAAAAAAA
>asdfgh|333|444|888
CCCCCCCCCC
>zxcvbn|555|666|777
GGGGGGGGGG
>plokmi|777|888|666
TTTTTTTTTT
>qwerty|111|222|555
AAAAAAAAAA
>asdfgh|333|444|444
CCCCCCCCCC
>zxcvbn|555|666|333
GGGGGGGGGG
>plokmi|777|888|222
TTTTTTTTTT
[download]

Then ran this script:

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my $infile = 'pm_1193237_input.fasta';

{
    my $out_fh;

    open my $in_fh, '<', $infile;

    while (<$in_fh>) {
        my $pos = index $_, '|';

        if ($pos == -1) {   # Sequence
            print $out_fh $_;
        }
        else {              # Header
            print { get_fh(\$out_fh, substr $_, 1, $pos - 1) } $_;
        }
    }

    close $in_fh;
}

close_out_fhs();

{
    my %fh_for;

    sub get_fh {
        my ($fh, $name) = @_;

        unless (exists $fh_for{$name}) {
            open $fh_for{$name}, '>', gen_fname($name);
        }

        $$fh = $fh_for{$name};
    }

    sub close_out_fhs { close $_ for values %fh_for }
}

sub gen_fname { 'pm_1193237_output_' . $_[0] . '.fasta' }
[download]

Which produced these files:

$ cat pm_1193237_output_qwerty.fasta
>qwerty|111|222|999
AAAAAAAAAA
>qwerty|111|222|555
AAAAAAAAAA
$ cat pm_1193237_output_asdfgh.fasta
>asdfgh|333|444|888
CCCCCCCCCC
>asdfgh|333|444|444
CCCCCCCCCC
$ cat pm_1193237_output_zxcvbn.fasta
>zxcvbn|555|666|777
GGGGGGGGGG
>zxcvbn|555|666|333
GGGGGGGGGG
$ cat pm_1193237_output_plokmi.fasta
>plokmi|777|888|666
TTTTTTTTTT
>plokmi|777|888|222
TTTTTTTTTT
[download]

— Ken

Comment on Re: An overlapping regex capture Select or Download Code

Replies are listed 'Best First'.
Re^2: An overlapping regex capture by Peter Keystrokes (Beadle) on Jun 22, 2017 at 09:50 UTC
Thank you for the advice I appreciate it. The file I am dealing with contains ~140k sequences, so I've created a dummy file to test the script on containing about 20 sequences. Pete.	[reply]