G'day Pete,

[I was going to comment on your code, but that's already been done. I'll just say I concur and encourage you to read and understand the excellent advice from ++Discipulus.]

Biological data is typically huge and you need to consider this when dealing with it. Avoid multiple loops. Don't automatically reach for a regex. Consider Perl's string handling functions, such as index and substr, which I demonstrate below: these are (almost) always faster than their regex equivalents.

One piece of information you omitted was how many uniq "hsa_circ_0000001"-type elements you have: this will equate to how many files you'll need to create. In the code below, I've assumed that the number is small enough that you could have them all open at once. If this isn't case, you can still use much the same technique, but you'll need to implement some sort of record of file usage: keeping open, files you're writing to often; closing and reopening the least used ones as required.

I dummied up some input data:

$ cat pm_1193237_input.fasta >qwerty|111|222|999 AAAAAAAAAA >asdfgh|333|444|888 CCCCCCCCCC >zxcvbn|555|666|777 GGGGGGGGGG >plokmi|777|888|666 TTTTTTTTTT >qwerty|111|222|555 AAAAAAAAAA >asdfgh|333|444|444 CCCCCCCCCC >zxcvbn|555|666|333 GGGGGGGGGG >plokmi|777|888|222 TTTTTTTTTT

Then ran this script:

#!/usr/bin/env perl use strict; use warnings; use autodie; my $infile = 'pm_1193237_input.fasta'; { my $out_fh; open my $in_fh, '<', $infile; while (<$in_fh>) { my $pos = index $_, '|'; if ($pos == -1) { # Sequence print $out_fh $_; } else { # Header print { get_fh(\$out_fh, substr $_, 1, $pos - 1) } $_; } } close $in_fh; } close_out_fhs(); { my %fh_for; sub get_fh { my ($fh, $name) = @_; unless (exists $fh_for{$name}) { open $fh_for{$name}, '>', gen_fname($name); } $$fh = $fh_for{$name}; } sub close_out_fhs { close $_ for values %fh_for } } sub gen_fname { 'pm_1193237_output_' . $_[0] . '.fasta' }

Which produced these files:

$ cat pm_1193237_output_qwerty.fasta >qwerty|111|222|999 AAAAAAAAAA >qwerty|111|222|555 AAAAAAAAAA $ cat pm_1193237_output_asdfgh.fasta >asdfgh|333|444|888 CCCCCCCCCC >asdfgh|333|444|444 CCCCCCCCCC $ cat pm_1193237_output_zxcvbn.fasta >zxcvbn|555|666|777 GGGGGGGGGG >zxcvbn|555|666|333 GGGGGGGGGG $ cat pm_1193237_output_plokmi.fasta >plokmi|777|888|666 TTTTTTTTTT >plokmi|777|888|222 TTTTTTTTTT

— Ken


In reply to Re: An overlapping regex capture by kcott
in thread An overlapping regex capture by Peter Keystrokes

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.