in reply to Re: An overlapping regex capture
in thread An overlapping regex capture

Hmm, but how would I integrate the split function into my loop in order to name the file I create for each sequence?

Replies are listed 'Best First'.
Re^3: An overlapping regex capture
by Discipulus (Canon) on Jun 21, 2017 at 20:49 UTC
    > how would I integrate the split function into my loop

    since you have two loops you can or integrate when you grab $id or just before opening the filehandle:

    But, looking more closer to your code, you have many wrong things:

    my %id2seq = (); # this is the verbose form of my %id2seq; my $id = ''; # this is the wrong place to declare this var! declar +e it when you need it ie # inside the while(<File>){ block # missing the mode: put always even if it defaults to '<' open File,"human_hg19_circRNAs_putative_spliced_sequence.fa",or die $! +; # better use lexical filehandle like in open my $fh, '<', $fi +lepath or die # bareword is still accepted but by onvention is UPPERCASE so + no open File... while(<File>){ chomp; # here you are capturing something: if you want just +the part before | you # have here the possibility to get it: /^>([\w\d]+\| +)/ as starting option? if($_ =~ /^>(.+)/){ # or here: $id = $1; # cutting $1 like in: $id = (split /\|/, $1)[0] ... # AHHH! this is error! are your use strict; use warnings ju +st make-up? # it must be foreach my $id .. # (or really does not raise a warn for the scope you given +to $id ??? if so is even # worst!!) # in short: pay attention to the scope of your variables foreach $id (keys %id2seq){ # here the last good possibility to cut $id: # $id = (split /\|/, $id)[0]; if (-f $id){ # this is a lie.. print $id." Already exists, about to override it","\n" } # .. because you are going to append, not to + overwrite open my $out_fh, '>>', "$id.fa" or die $!; # here parens are unneeded and probably nasty print $out_fh (">".$id."\n",$id2seq{$id}, "\n");

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      But the thing is, I want to be able to capture the full fasta sequence title (>hsa_circ_0000001|chr1:1080738-1080845-|None|None) along with the sequence and print that into a file and I want to capture a part of the sequence title and use it to name the file (hsa_circ_0000001) excluding '>'.

      If I use your method surely I will not be able to print the full fasta sequence title followed by a newline and then the sequence in my newly created file, because $id will become (hsa_circ_0000001), whereas I need to be able to capture both (hsa_circ_0000001) and (>hsa_circ_0000001|chr1:1080738-1080845-|None|None) and apply them seamlessly as the new files are created in my loop.

      Pete.

        If so create a new variable for the filename, just before creating the file, as in
        foreach $id (keys %id2seq){ # here the last good possibility to cut $id: my $filename = (split /\|/, $id)[0]; ...

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Okay, I fixed up the items you suggested I should, and so my script is looking like this:
      #!/usr/bin/perl use strict; use warnings; open my $fh, '<',"human_hg19_circRNAs_putative_spliced_sequence.fa",or + die $!; my %id2seq; while(<$fh>){ my $id = ''; chomp; if($_ =~ /^>(.+)/){ $id = $1; }else{ $id2seq{$id} .= $_; } } foreach my $id (keys %id2seq){ my $filename = (split /\|/, $id)[0]; open my $out_fh, '>>', "$filename" or die $!; print $out_fh ">".$id."\n",$id2seq{$id}, "\n"; close $out_fh; } close $fh;

      How do I integrate the value I've split and extracted into the naming of the file, because it's stating that it's uninitialised?

      Although, I thought that it was clearly initialised/defined here:

      my $filename = (split /\|/, $id)[0]; open my $out_fh, '>>', "$filename" or die $!;

      Or maybe I'm just misunderstanding the scope? Where do I place the $filename in the loop?

      Pete.

        The problem is earlier where $id is set for the > lines but then cleared on the subsequent sequence lines

        while(<$fh>){ my $id = ''; chomp; if ($_ =~ /^>(.+)/){ $id = $1; } else { $id2seq{$id} .= $_; } }

        Try

        #!/usr/bin/perl use strict; use warnings; my $id; my %id2seq; my $infile = 'human_hg19_circRNAs_putative_spliced_sequence.fa'; open my $fh,'<',$infile or die "Could not open $infile : $!"; while (<$fh>){ if ( /^>(.+)/ ){ $id = (split /\|/, $1)[0]; } $id2seq{$id} .= $_; } foreach my $id (keys %id2seq){ my $filename = $id.'.fa'; print "Creating $filename\n"; open my $out_fh,'>', $filename or die "Could not open $filename : $!"; print $out_fh $id2seq{$id}; close $out_fh; } close $fh;
        poj

        Did you try printing the values of your variables as I suggested?


        The way forward always starts with a minimal test.

      Okay, so I tried to implement that as so:

      foreach $id (keys %id2seq){ my $filename = (split /\|/, $id)[0]; open my $out_fh, '>>', "$filename.fa" or die $!; print $out_fh (">".$id."\n",$id2seq{$id}, "\n"); close $out_fh; }
      But I got an error saying that $filename was uninitialized at the 3rd line down ---  open  my $out_fh, '>>', "$filename.fa" or die $!;

      So I'm just trying to figure out where I should be declaring  $filename

      Pete.

        First, follow the advice given. Go through your script and correct *all* the items Discipulus pointed out. Second, when in doubt, print out the value of your variables.

        use Data::Dumper; ... foreach my $id (keys %id2seq) { warn "ID: $id"; my @segments = split /\|/, $id; warn "Segments: " . Dumper \@segments; my $filename = $segments[0]; ... }
        Once the code is working right, remove the debug statements.


        The way forward always starts with a minimal test.
        No, you got the error saying that $filename.fa is uninitialized.

        So perl saw that as a variable name while you intended to only use $filename instead.

        In order to delimit the variable name you can use this syntax: "${filename}.fa".

        Good luck, hexcoder

      Thank you for this advice,it's golden, I will try to implement all your points today.

      Pete.