Peter Keystrokes has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I have a script which extracts sequences from a file containing thousands of fasta sequences and creates separate files for each of them.

Here is my script:

#!/usr/bin/perl use strict; use warnings; my %id2seq = (); my $id = ''; open File,"human_hg19_circRNAs_putative_spliced_sequence.fa",or die $! +; while(<File>){ chomp; if($_ =~ /^>(.+)/){ $id = $1; }else{ $id2seq{$id} .= $_; } } foreach $id (keys %id2seq){ if (-f $id){ print $id." Already exists, about to override it","\n" } open my $out_fh, '>>', "$id.fa" or die $!; print $out_fh (">".$id."\n",$id2seq{$id}, "\n"); close $out_fh; } close File;

Now, the human_hg19_circRNAs_putative_spliced_sequence.fa file which I am working on contains sequences as such:

>hsa_circ_0000001|chr1:1080738-1080845-|None|None

ATGGGGTTGGGTCAGCCGTGCGGTCAGGTCAGGTCGGCCATGAGGTCAGGTGGGGTCGGCCATGAAGGTGGTGGGGGTCATGAGGTCACAAGGGGGTCGGCCATGTG

My script captures each sequence header as the key of a hash and captures the sequence itself as the hash. But the problem is that I want to name the files with only a part of the $id and not the whole of it i.e. hsa_circ_0000001.

Is there a simple way to do this? Or do I have to create a new hash to extract filenames?

Pete.

Replies are listed 'Best First'.
Re: An overlapping regex capture
by Discipulus (Canon) on Jun 21, 2017 at 19:27 UTC
    You can use split to split $id at every | occurence and use the first one:

    my @parts = split '|', $id; # ouch! what trivial error! is: /\|/ see c +horoba below my $filename = $parts[0]; # or simply my $filename = (split '|', $id)[0]; # ouch! what trivial error! is: / +\|/ see choroba below

    PS as general rule try to isolate your problem/question as much as you can and then choose a meningful title: An overlapping regex capture does not describe well what you are asking for.. or I'm totally missing the point..

    HtH

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      The first parameter to split is a regex, not a string (except for a space). | in a regex has a special meaning, so you need to backslash it for the literal pipe symbol:
      my @pargs = split /\|/, $id;
      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Hmm, but how would I integrate the split function into my loop in order to name the file I create for each sequence?
        > how would I integrate the split function into my loop

        since you have two loops you can or integrate when you grab $id or just before opening the filehandle:

        But, looking more closer to your code, you have many wrong things:

        my %id2seq = (); # this is the verbose form of my %id2seq; my $id = ''; # this is the wrong place to declare this var! declar +e it when you need it ie # inside the while(<File>){ block # missing the mode: put always even if it defaults to '<' open File,"human_hg19_circRNAs_putative_spliced_sequence.fa",or die $! +; # better use lexical filehandle like in open my $fh, '<', $fi +lepath or die # bareword is still accepted but by onvention is UPPERCASE so + no open File... while(<File>){ chomp; # here you are capturing something: if you want just +the part before | you # have here the possibility to get it: /^>([\w\d]+\| +)/ as starting option? if($_ =~ /^>(.+)/){ # or here: $id = $1; # cutting $1 like in: $id = (split /\|/, $1)[0] ... # AHHH! this is error! are your use strict; use warnings ju +st make-up? # it must be foreach my $id .. # (or really does not raise a warn for the scope you given +to $id ??? if so is even # worst!!) # in short: pay attention to the scope of your variables foreach $id (keys %id2seq){ # here the last good possibility to cut $id: # $id = (split /\|/, $id)[0]; if (-f $id){ # this is a lie.. print $id." Already exists, about to override it","\n" } # .. because you are going to append, not to + overwrite open my $out_fh, '>>', "$id.fa" or die $!; # here parens are unneeded and probably nasty print $out_fh (">".$id."\n",$id2seq{$id}, "\n");

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: An overlapping regex capture
by kcott (Archbishop) on Jun 22, 2017 at 06:25 UTC

    G'day Pete,

    [I was going to comment on your code, but that's already been done. I'll just say I concur and encourage you to read and understand the excellent advice from ++Discipulus.]

    Biological data is typically huge and you need to consider this when dealing with it. Avoid multiple loops. Don't automatically reach for a regex. Consider Perl's string handling functions, such as index and substr, which I demonstrate below: these are (almost) always faster than their regex equivalents.

    One piece of information you omitted was how many uniq "hsa_circ_0000001"-type elements you have: this will equate to how many files you'll need to create. In the code below, I've assumed that the number is small enough that you could have them all open at once. If this isn't case, you can still use much the same technique, but you'll need to implement some sort of record of file usage: keeping open, files you're writing to often; closing and reopening the least used ones as required.

    I dummied up some input data:

    $ cat pm_1193237_input.fasta >qwerty|111|222|999 AAAAAAAAAA >asdfgh|333|444|888 CCCCCCCCCC >zxcvbn|555|666|777 GGGGGGGGGG >plokmi|777|888|666 TTTTTTTTTT >qwerty|111|222|555 AAAAAAAAAA >asdfgh|333|444|444 CCCCCCCCCC >zxcvbn|555|666|333 GGGGGGGGGG >plokmi|777|888|222 TTTTTTTTTT

    Then ran this script:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $infile = 'pm_1193237_input.fasta'; { my $out_fh; open my $in_fh, '<', $infile; while (<$in_fh>) { my $pos = index $_, '|'; if ($pos == -1) { # Sequence print $out_fh $_; } else { # Header print { get_fh(\$out_fh, substr $_, 1, $pos - 1) } $_; } } close $in_fh; } close_out_fhs(); { my %fh_for; sub get_fh { my ($fh, $name) = @_; unless (exists $fh_for{$name}) { open $fh_for{$name}, '>', gen_fname($name); } $$fh = $fh_for{$name}; } sub close_out_fhs { close $_ for values %fh_for } } sub gen_fname { 'pm_1193237_output_' . $_[0] . '.fasta' }

    Which produced these files:

    $ cat pm_1193237_output_qwerty.fasta >qwerty|111|222|999 AAAAAAAAAA >qwerty|111|222|555 AAAAAAAAAA $ cat pm_1193237_output_asdfgh.fasta >asdfgh|333|444|888 CCCCCCCCCC >asdfgh|333|444|444 CCCCCCCCCC $ cat pm_1193237_output_zxcvbn.fasta >zxcvbn|555|666|777 GGGGGGGGGG >zxcvbn|555|666|333 GGGGGGGGGG $ cat pm_1193237_output_plokmi.fasta >plokmi|777|888|666 TTTTTTTTTT >plokmi|777|888|222 TTTTTTTTTT

    — Ken

      Thank you for the advice I appreciate it.

      The file I am dealing with contains ~140k sequences, so I've created a dummy file to test the script on containing about 20 sequences.

      Pete.