in reply to Skript help needed - RegEx & Hashes

Thank you all so so much for your feedback and the help, you are my heros right now. The errors pointed out by poj, hippo and haukex solved the issue, so when I only make those quick changes (the one variable change still can't believe I overlooked that I used the wrong variable all along and the append mode) in my old script and it worked wonders. I get what I want and I could cry tears of happiness right now. You guys can not imagine the relief I am feeling right now.

However I do not just want a working script anymore but actually one that is also looking good. For that I would love to implement all the changes that have been suggested, however I am running in several problems while doing so.

1) when I try to declare the variables when I need them for some reasons I still get the following warning: Global symbol "VARIABLE" requires explicit package name (did you forget to declare "my VARIABLE"?)

2) While trying to use the newer suggested way of opening files I get the following warning: Scalar found where operator expected at new_reads.pl line 24, near """$folder" (Missing operator before $folder?)

I don't really understand why these errors are occurring, but I would love to get it fixed simply because I want my script not to be an "outdated" thing, using old ways of handling files and variables lol.

I copied the "newer" version of the script below and once again, thank all of you for your help.

#!/usr/bin/perl -w use strict; use warnings; #Initiate all variables, hashes and co #Open folders in working directy my @folders = glob("*"); #to get all folders in directory; extension ( +"*") as wildcard to get all names foreach my $folder(@folders) #to speak to each element in directory { next if ($folder!~/^UNITAS_/); #skip elements which do not start w +ith "UNITAS" opendir(DIR,$folder)||die print$!; #open folder, end script when o +pening is not possible (DIR is the "filehandle" for the directory) print"\n$folder"; while( my $file=readdir(DIR)) #returns content of folder { next if($file!~/\.mapped_sequences$/); #get the mapped_sequenc +es file we need to read out the reads print"\n$file"; #print out file names to make sure we get the +right files my $reads = 0; #set the number of reads to 0 for each run open my $fileone, '<', "$folder/$file" or die ""$folder/$file" +: $!"; while(my $tocount=<$fileone>)#read file { chomp $tocount; $tocount =~ s/>//g; #remove all ">" next if ($tocount =~ /[A-Za-z]/); #skip lines which contai +n the sequence if ($tocount =~ /[0-9]/) #get the read-number { print"\n$tocount"; $reads = ($reads + $tocount); # add up all reads } print"\n$reads"; } close $fileone; my $trftable = 'unitas.tRF-table.txt'; #save file name in vari +able open my $trf, '<', "$folder/$trftable" or die ""$folder/$trfta +ble": $!"; undef = <$trf> for 1 .. 4; my %hash = (); #initiate empty hash while( my $line=<$trf>) { chomp $line; my @line=split("\t",$line); if($line[0]=~s/tRNA-[^-]+-...//) # "tRNA-"(matched tRNA un +d -) "[^-]+" beginning bis Ende, egal was "-..."(weiterer Strich bis +Ende) { my $tRNAname=$line[0]; $tRNAname=$&; # "$&" = last pattern match print"\n$tRNAname"; } else { my $tRNAname=$line[0]; $tRNAname=~s/-ENS.+$//; # "-ENS.+$" ( matched allen di +e -ENS. bis Ende enthalten) print"\n$tRNAname"; } my $hash{$tRNAname}{"5p-tR-halves"}+=$line[1]/$reads*10000 +00; $hash{$tRNAname}{"5p-tRFs"}+=$line[3]/$reads*1000000; $hash{$tRNAname}{"3p-tR-halves"}+=$line[5]/$reads*1000000; $hash{$tRNAname}{"3p-CCA-tRFs"}+=$line[7]/$reads*1000000; $hash{$tRNAname}{"3p-tRFs"}+=$line[9]/$reads*1000000; $hash{$tRNAname}{"tRF-1"}+=$line[11]/$reads*1000000; $hash{$tRNAname}{"tRNA-leader"}+=$line[13]/$reads*1000000; $hash{$tRNAname}{"misc-tRFs"}+=$line[15]/$reads*1000000; } open my $merge,">>","$folder/$merge" or die "Could not open $f +older/$merge : $!"; my @tRF_types=("5p-tR-halves","5p-tRFs","3p-tR-halves","3p-CCA +-tRFs","3p-tRFs","tRF-1","tRNA-leader","misc-tRFs"); foreach $tRNAname(sort{$a cmp $b}keys%hash) #sortiert die alph +abetisch nach keys { print MERGE $tRNAname; # print tRNA name foreach my $tRF_type(@tRF_types) { print MERGE"\t$hash{$tRNAname}{$tRF_type}"; # print co +unts for each tRF type separated by tab } print MERGE"\n";# print newline } close TRF; close MERGE; close DIR; } }

Replies are listed 'Best First'.
Re^2: Skript help needed - RegEx & Hashes
by poj (Abbot) on Oct 11, 2018 at 18:11 UTC

    A few typos to correct

    line 24 - remove double " 
    #open my $fileone, '<', "$folder/$file" or die ""$folder/$file": $!";
    open my $fileone, '<', "$folder/$file" or die "$folder/$file: $!";
    
    
    line 44 - same
    #open my $trf, '<', "$folder/$trftable" or die ""$folder/$trftable": $!";
    open my $trf, '<', "$folder/$trftable" or die "$folder/$trftable: $!";
    	
    
    line 46 remove undef
    #undef = <$trf> for 1 .. 4;
    <$trf> for 1 .. 4;
    
    line 49 add declare here to expand scope of variable
    my $tRNAname;	
    
    line 57 - remove my
    #my $tRNAname=$line[0];
    $tRNAname=$line[0];
    
    line 63 remove my
    #my $tRNAname=$line[0];
    $tRNAname=$line[0];
    
    line 68 - remove my  ( as %hash declared earlier )
    #my $hash{$tRNAname}{"5p-tR-halves"}+=$line1/$reads*1000000;
    $hash{$tRNAname}{"5p-tR-halves"}+=$line1/$reads*1000000;
    
    line 69 - change $merge after $folder to merge
    #open my $merge,">>","$folder/$merge" or die "Could not open $folder/$merge : $!";
    open my $merge,">>","$folder/merge" or die "Could not open $folder/merge : $!";
    
    line 84..94 change MERGE to $merge
    print MERGE $tRNAname; # print tRNA name
    foreach my $tRF_type(@tRF_types) 
     {
    	print MERGE"\t$hash{$tRNAname}{$tRF_type}"; # print counts for each tRF type separated by tab
     }
    print MERGE"\n";# print newline
    }
    close TRF;
    close MERGE;
    close DIR;
    
    poj
      line 24 - remove double " #open my $fileone, '<', "$folder/$file" or die ""$folder/$file": $!";

      In case it isn't obvious to anyone reading this, the problem with die ""$folder/$file": $!"; is that there's nothing to say that the second " is a literal character and not simply closing the empty string started by the first " character. Here are some alternatives:

      die "'$folder/$file': $!"; # choose a different literal character die "\"$folder/$file\": $!"; # escape the inner " die qq{"$folder/$file": $!}; # use an alternative outer delimiter
Re^2: Skript help needed - RegEx & Hashes
by haukex (Archbishop) on Oct 12, 2018 at 11:20 UTC

    Here's how I might have written that script, with the following changes to your version:

    • Formatting: I've used my personal formatting style; a matter of taste of course (and sometimes I even vary my own style, if I think it looks better another way). For example, I added a bunch of whitespace and removed a couple of parens where it's not strictly necessary (but you are free to add parens if you like). I wrapped a few long lines so they would display nicely here, but usually I write my open ... or die ... on one line if it fits reasonably.
    • Don't need both #!/usr/bin/perl -w and use warnings; (What's wrong with -w and $^W)
    • I used the glob suggestion from jwkrahn, and also made sure that it would only return directories with the -d filetest operator. Note that glob has quite a few caveats, but with fixed strings it's ok.
    • I applied most of the changes suggested by poj and others.
    • Just to shorten the code a bit, I used an intermediate hash reference $h for $hash{$tRNAname}, in order for this to work I had to make sure to initialize $hash{$tRNAname} with an empty anonymous hash: my $h = ( $hash{$tRNAname} //= {} ) means "assign {} (an empty anonymous hash) to $hash{$tRNAname} if the latter is not yet defined, then assign the value of $hash{$tRNAname} to $h". (See also perlreftut and perlref.) Update: And see hippo's reply for one way to shorten it even more.
    • You were closing the directory handle too early, and I had to change the scoping of a couple of variables like $tRNAname.
    • I switched to using Data::Dumper to output the variables, which I configured in a way that I like the output better (although normally I'd use Data::Dump; Date::Dumper is a core module). BTW, I'm not sure why you were prefixing the \n in your prints, but normally one would do things like print "$tocount\n";
    • You said open my $merge,">>","$folder/$merge", but the latter variable doesn't yet exist at that point (my $merge doesn't take effect until after the open statement), and in your original script you said open(MERGE,">merge"), so I'm not sure if you want a merge file per folder, or a single merge file in the current working directory? If it's the former, the probably hippo's suggestion of opening the file once at the top of the script is better, also then you don't have to use append mode.
    • I'm not sure about if ( $tocount =~ /[0-9]/ ): If you want to make sure that it contains only digits, you should anchor your regex, as in /^[0-9]$/ /^[0-9]+$/.
    • Plus I made a few other tweaks and used idioms in a few places, such as ( $tRNAname = $line[0] ) =~ s/-ENS.+$//, which means "copy $line[0] to $tRNAname and then apply the regex to $tRNAname".
    • {$a cmp $b} is the default sort order and isn't really needed, unless you really want to be explicit (it doesn't hurt).

    Please have a look, and if you have any questions, please let us know.

    #!/usr/bin/env perl use warnings; use strict; use Data::Dumper; $Data::Dumper::Useqq = 1; $Data::Dumper::Quotekeys = 0; $Data::Dumper::Sortkeys = 1; for my $folder ( grep {-d} glob('UNITAS_*') ) { print Data::Dumper->Dump([$folder], [qw/folder/]); opendir my $dh, $folder or die "$folder: $!"; while ( my $file = readdir($dh) ) { next if $file !~ /\.mapped_sequences$/; print Data::Dumper->Dump([$file], [qw/file/]); my $reads = 0; open my $fileone, '<', "$folder/$file" or die "$folder/$file: $!"; while ( my $tocount = <$fileone> ) { chomp $tocount; $tocount =~ s/>//g; next if $tocount =~ /[A-Za-z]/; if ( $tocount =~ /[0-9]/ ) { print Data::Dumper->Dump([$tocount], [qw/tocount/]); $reads += $tocount; } } close $fileone; print Data::Dumper->Dump([$reads], [qw/reads/]); my %hash; my $trftable = 'unitas.tRF-table.txt'; open my $trf, '<', "$folder/$trftable" or die "$folder/$trftable: $!"; <$trf> for 1 .. 4; while ( my $line = <$trf> ) { chomp $line; my @line = split /\t/, $line; #print Data::Dumper->Dump([\@line], [qw/*line/]); my $tRNAname; if ( $line[0] =~ s/tRNA-[^-]+-...// ) { $tRNAname = $& } else { ( $tRNAname = $line[0] ) =~ s/-ENS.+$// } print Data::Dumper->Dump([$tRNAname], [qw/tRNAname/]); my $h = ( $hash{$tRNAname} //= {} ); $h->{"5p-tR-halves"} += $line[ 1] / $reads * 1000000; $h->{"5p-tRFs"} += $line[ 3] / $reads * 1000000; $h->{"3p-tR-halves"} += $line[ 5] / $reads * 1000000; $h->{"3p-CCA-tRFs"} += $line[ 7] / $reads * 1000000; $h->{"3p-tRFs"} += $line[ 9] / $reads * 1000000; $h->{"tRF-1"} += $line[11] / $reads * 1000000; $h->{"tRNA-leader"} += $line[13] / $reads * 1000000; $h->{"misc-tRFs"} += $line[15] / $reads * 1000000; } close $trf; print Data::Dumper->Dump([\%hash], [qw/*hash/]); open my $merge, '>>', "$folder/merge" or die "$folder/merge: $!"; my @tRF_types = ("5p-tR-halves", "5p-tRFs", "3p-tR-halves", "3p-CCA-tRFs", "3p-tRFs", "tRF-1", "tRNA-leader", "misc-tRFs"); for my $tRNAname ( sort keys %hash ) { print $merge $tRNAname; for my $tRF_type (@tRF_types) { print $merge "\t$hash{$tRNAname}{$tRF_type}"; } print $merge "\n"; } close $merge; } close $dh; }

    For the sample data from this post, the output file merge I get is the following. Note that if you re-run the script, because of the append mode on the merge file, the same lines get added to that file again.

    MT-TM 6500000 4500000 0 0 0 0 0 20416666.66666 +66 MT-TN 0 500000 0 0 0 750000 1000000 12625000 MT-TP 0 0 0 0 0 1000000 0 0 tRNA-Ala-AGC 0 3863095.23809524 0 0 136363.636363636 + 0 0 23353306.8783069 tRNA-Ala-CGC 0 8708333.33333333 0 0 0 14500000 50 +0000 8980291.00529101 tRNA-Ala-TGC 0 11833333.3333333 0 0 90909.0909090909 + 0 0 84521296.2962963 tRNA-Arg-ACG 0 100000 0 71428.5714285715 0 6500000 + 0 4916666.66666667

    Update: Minor edits and a few additions to the explanations.

      $h->{"5p-tR-halves"} += $line[ 1] / $reads * 1000000; $h->{"5p-tRFs"} += $line[ 3] / $reads * 1000000; $h->{"3p-tR-halves"} += $line[ 5] / $reads * 1000000; $h->{"3p-CCA-tRFs"} += $line[ 7] / $reads * 1000000; $h->{"3p-tRFs"} += $line[ 9] / $reads * 1000000; $h->{"tRF-1"} += $line[11] / $reads * 1000000; $h->{"tRNA-leader"} += $line[13] / $reads * 1000000; $h->{"misc-tRFs"} += $line[15] / $reads * 1000000;

      Years ago that would not have bothered me but nowadays it makes me twitch. YMMV but for DRY:

      my $i = 1; for (qw/5p-tR-halves 5p-tRFs 3p-tR-halves 3p-CCA-tRFs 3p-tRFs tRF-1 tR +NA-leader misc-tRFs/) { $->{$_} += $line[$i] / $reads * 1000000; $i += 2; # Odd entries only }

        I thought about doing something like that, but ended up leaving it out - I should've added it, since I agree with you!

Re^2: Skript help needed - RegEx & Hashes
by jwkrahn (Abbot) on Oct 11, 2018 at 18:25 UTC
    undef = <$trf> for 1 .. 4; my %hash = (); #initiate empty hash while( my $line=<$trf>) { chomp $line; my @line=split("\t",$line);

    Another way to skip the first four lines:

    my %hash = (); #initiate empty hash while ( my $line = <$trf> ) { next if $. < 5; # skip first four lines chomp $line; my @line = split /\t/, $line;