in reply to Re^2: An overlapping regex capture
in thread An overlapping regex capture

> how would I integrate the split function into my loop

since you have two loops you can or integrate when you grab $id or just before opening the filehandle:

But, looking more closer to your code, you have many wrong things:

my %id2seq = (); # this is the verbose form of my %id2seq; my $id = ''; # this is the wrong place to declare this var! declar +e it when you need it ie # inside the while(<File>){ block # missing the mode: put always even if it defaults to '<' open File,"human_hg19_circRNAs_putative_spliced_sequence.fa",or die $! +; # better use lexical filehandle like in open my $fh, '<', $fi +lepath or die # bareword is still accepted but by onvention is UPPERCASE so + no open File... while(<File>){ chomp; # here you are capturing something: if you want just +the part before | you # have here the possibility to get it: /^>([\w\d]+\| +)/ as starting option? if($_ =~ /^>(.+)/){ # or here: $id = $1; # cutting $1 like in: $id = (split /\|/, $1)[0] ... # AHHH! this is error! are your use strict; use warnings ju +st make-up? # it must be foreach my $id .. # (or really does not raise a warn for the scope you given +to $id ??? if so is even # worst!!) # in short: pay attention to the scope of your variables foreach $id (keys %id2seq){ # here the last good possibility to cut $id: # $id = (split /\|/, $id)[0]; if (-f $id){ # this is a lie.. print $id." Already exists, about to override it","\n" } # .. because you are going to append, not to + overwrite open my $out_fh, '>>', "$id.fa" or die $!; # here parens are unneeded and probably nasty print $out_fh (">".$id."\n",$id2seq{$id}, "\n");

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^4: An overlapping regex capture
by Peter Keystrokes (Beadle) on Jun 22, 2017 at 11:52 UTC
    Okay, I fixed up the items you suggested I should, and so my script is looking like this:
    #!/usr/bin/perl use strict; use warnings; open my $fh, '<',"human_hg19_circRNAs_putative_spliced_sequence.fa",or + die $!; my %id2seq; while(<$fh>){ my $id = ''; chomp; if($_ =~ /^>(.+)/){ $id = $1; }else{ $id2seq{$id} .= $_; } } foreach my $id (keys %id2seq){ my $filename = (split /\|/, $id)[0]; open my $out_fh, '>>', "$filename" or die $!; print $out_fh ">".$id."\n",$id2seq{$id}, "\n"; close $out_fh; } close $fh;

    How do I integrate the value I've split and extracted into the naming of the file, because it's stating that it's uninitialised?

    Although, I thought that it was clearly initialised/defined here:

    my $filename = (split /\|/, $id)[0]; open my $out_fh, '>>', "$filename" or die $!;

    Or maybe I'm just misunderstanding the scope? Where do I place the $filename in the loop?

    Pete.

      The problem is earlier where $id is set for the > lines but then cleared on the subsequent sequence lines

      while(<$fh>){ my $id = ''; chomp; if ($_ =~ /^>(.+)/){ $id = $1; } else { $id2seq{$id} .= $_; } }

      Try

      #!/usr/bin/perl use strict; use warnings; my $id; my %id2seq; my $infile = 'human_hg19_circRNAs_putative_spliced_sequence.fa'; open my $fh,'<',$infile or die "Could not open $infile : $!"; while (<$fh>){ if ( /^>(.+)/ ){ $id = (split /\|/, $1)[0]; } $id2seq{$id} .= $_; } foreach my $id (keys %id2seq){ my $filename = $id.'.fa'; print "Creating $filename\n"; open my $out_fh,'>', $filename or die "Could not open $filename : $!"; print $out_fh $id2seq{$id}; close $out_fh; } close $fh;
      poj

      Did you try printing the values of your variables as I suggested?


      The way forward always starts with a minimal test.

        Yes, I tried it and got the following:

        ID: at seqextractor.pl line 23, <$fh> line 130.

        Segments: $VAR1 = [];

        From the using the following

        foreach my $id (keys %id2seq){ warn "ID: $id"; my @segments = split /\|/, $id; warn "Segments: " . Dumper \@segments; my $filename = $segments[0]; }

        Does it mean that the array is empty?

Re^4: An overlapping regex capture
by Peter Keystrokes (Beadle) on Jun 21, 2017 at 21:11 UTC
    But the thing is, I want to be able to capture the full fasta sequence title (>hsa_circ_0000001|chr1:1080738-1080845-|None|None) along with the sequence and print that into a file and I want to capture a part of the sequence title and use it to name the file (hsa_circ_0000001) excluding '>'.

    If I use your method surely I will not be able to print the full fasta sequence title followed by a newline and then the sequence in my newly created file, because $id will become (hsa_circ_0000001), whereas I need to be able to capture both (hsa_circ_0000001) and (>hsa_circ_0000001|chr1:1080738-1080845-|None|None) and apply them seamlessly as the new files are created in my loop.

    Pete.

      If so create a new variable for the filename, just before creating the file, as in
      foreach $id (keys %id2seq){ # here the last good possibility to cut $id: my $filename = (split /\|/, $id)[0]; ...

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re^4: An overlapping regex capture
by Peter Keystrokes (Beadle) on Jun 22, 2017 at 09:54 UTC

    Thank you for this advice,it's golden, I will try to implement all your points today.

    Pete.

Re^4: An overlapping regex capture
by Peter Keystrokes (Beadle) on Jun 21, 2017 at 21:47 UTC

    Okay, so I tried to implement that as so:

    foreach $id (keys %id2seq){ my $filename = (split /\|/, $id)[0]; open my $out_fh, '>>', "$filename.fa" or die $!; print $out_fh (">".$id."\n",$id2seq{$id}, "\n"); close $out_fh; }
    But I got an error saying that $filename was uninitialized at the 3rd line down ---  open  my $out_fh, '>>', "$filename.fa" or die $!;

    So I'm just trying to figure out where I should be declaring  $filename

    Pete.

      First, follow the advice given. Go through your script and correct *all* the items Discipulus pointed out. Second, when in doubt, print out the value of your variables.

      use Data::Dumper; ... foreach my $id (keys %id2seq) { warn "ID: $id"; my @segments = split /\|/, $id; warn "Segments: " . Dumper \@segments; my $filename = $segments[0]; ... }
      Once the code is working right, remove the debug statements.


      The way forward always starts with a minimal test.
      No, you got the error saying that $filename.fa is uninitialized.

      So perl saw that as a variable name while you intended to only use $filename instead.

      In order to delimit the variable name you can use this syntax: "${filename}.fa".

      Good luck, hexcoder
        No, you got the error saying that $filename.fa is uninitialized. So perl saw that as a variable name ...

        Sorry, but that's not correct in this case, a dot does end the variable name being interpolated. As the OP is using strict, that would have caught the error anyway. Your advice does apply for other characters though.

        $ perl -wMstrict -le 'my $fn="x"; print "$fn.y"' x.y $ perl -wMstrict -le 'my $fn="x"; print "$fn_y"' Global symbol "$fn_y" requires explicit package name (did you forget t +o declare "my $fn_y"?) at -e line 1. Execution of -e aborted due to compilation errors. $ perl -w -le 'my $fn="x"; print "$fn_y"' Name "main::fn_y" used only once: possible typo at -e line 1. Use of uninitialized value $fn_y in string at -e line 1. $ perl -wMstrict -le 'my $fn="x"; print "${fn}_y"' x_y