in reply to An overlapping regex capture

G'day Pete,

[I was going to comment on your code, but that's already been done. I'll just say I concur and encourage you to read and understand the excellent advice from ++Discipulus.]

Biological data is typically huge and you need to consider this when dealing with it. Avoid multiple loops. Don't automatically reach for a regex. Consider Perl's string handling functions, such as index and substr, which I demonstrate below: these are (almost) always faster than their regex equivalents.

One piece of information you omitted was how many uniq "hsa_circ_0000001"-type elements you have: this will equate to how many files you'll need to create. In the code below, I've assumed that the number is small enough that you could have them all open at once. If this isn't case, you can still use much the same technique, but you'll need to implement some sort of record of file usage: keeping open, files you're writing to often; closing and reopening the least used ones as required.

I dummied up some input data:

$ cat pm_1193237_input.fasta >qwerty|111|222|999 AAAAAAAAAA >asdfgh|333|444|888 CCCCCCCCCC >zxcvbn|555|666|777 GGGGGGGGGG >plokmi|777|888|666 TTTTTTTTTT >qwerty|111|222|555 AAAAAAAAAA >asdfgh|333|444|444 CCCCCCCCCC >zxcvbn|555|666|333 GGGGGGGGGG >plokmi|777|888|222 TTTTTTTTTT

Then ran this script:

#!/usr/bin/env perl use strict; use warnings; use autodie; my $infile = 'pm_1193237_input.fasta'; { my $out_fh; open my $in_fh, '<', $infile; while (<$in_fh>) { my $pos = index $_, '|'; if ($pos == -1) { # Sequence print $out_fh $_; } else { # Header print { get_fh(\$out_fh, substr $_, 1, $pos - 1) } $_; } } close $in_fh; } close_out_fhs(); { my %fh_for; sub get_fh { my ($fh, $name) = @_; unless (exists $fh_for{$name}) { open $fh_for{$name}, '>', gen_fname($name); } $$fh = $fh_for{$name}; } sub close_out_fhs { close $_ for values %fh_for } } sub gen_fname { 'pm_1193237_output_' . $_[0] . '.fasta' }

Which produced these files:

$ cat pm_1193237_output_qwerty.fasta >qwerty|111|222|999 AAAAAAAAAA >qwerty|111|222|555 AAAAAAAAAA $ cat pm_1193237_output_asdfgh.fasta >asdfgh|333|444|888 CCCCCCCCCC >asdfgh|333|444|444 CCCCCCCCCC $ cat pm_1193237_output_zxcvbn.fasta >zxcvbn|555|666|777 GGGGGGGGGG >zxcvbn|555|666|333 GGGGGGGGGG $ cat pm_1193237_output_plokmi.fasta >plokmi|777|888|666 TTTTTTTTTT >plokmi|777|888|222 TTTTTTTTTT

— Ken

Replies are listed 'Best First'.
Re^2: An overlapping regex capture
by Peter Keystrokes (Beadle) on Jun 22, 2017 at 09:50 UTC
    Thank you for the advice I appreciate it.

    The file I am dealing with contains ~140k sequences, so I've created a dummy file to test the script on containing about 20 sequences.

    Pete.