Skript help needed - RegEx & Hashes

PandaRaey has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Skript help needed - RegEx & Hashes by haukex (Archbishop) on Oct 10, 2018 at 11:00 UTC
Welcome to Perl and PerlMonks, PandaRaey! First, a few general tips: It's very good you're using strict and warnings! However, note that pre-declaring your variables at the top of the script like that is not so good, because they are then like global variables. Instead, it's best to declare them at the smallest scope where they are needed. For example, `foreach my $folder (@folders) {`, `while ( my $file = readdir(DIR) ) {`, or `my $reads = 0;`. Skimming your code, I don't see any obvious scoping issues caused by the global variables, but I might have overlooked something. ~~Please try to use consistent indentation.~~ perltidy can help. I would recommend using the more modern three-argument form of open, and lexical filehandles (`my $fh` instead of bareword filehandles like `FILE`, since the latter are global). For example: `open my $fh, '<', $filename or die "$filename: $!";` Always check open for errors, which you do on your first open, but not on the second or third. The last point may even account for your problem, "I do not get the content I should printed into the Merge-File". Another possibility is that you might want to open that file for appending (`">>"`), because `">"` overwrites the file (Update: I just saw that hippo made the same point.). Other than that, I have looked at your code, and nothing obvious has jumped out at me yet. A few thoughts: You don't seem to be chomping the lines you read from files (removing the newline at the end), and you could use some of the tips from the Basic debugging checklist, like using Data::Dumper to print out the contents of your variables while your program is running (I recommend setting `$Data::Dumper::Useqq=1;` to see whitespace better). Also, I see you're doing `$tRNAname = $line[0]; $tRNAname = $&;`, which doesn't make sense to me because the second assignment will just overwrite the first. Other than that, your expected output seems to depend on your data and algorithm. The problem is that without sample data, we can't really run the code. It would be best if you could provide a Short, Self-Contained, Correct Example, that is, some small sample input that demonstrates your problem, the expected output for that input, and your actual output, including any error messages you might be getting (all within `<code>` tags). Also a description of what your algorithm is supposed to be doing would help.	[reply] [d/l] [select]
Re^2: Script help needed - RegEx & Hashes by hippo (Archbishop) on Oct 10, 2018 at 11:14 UTC
All good points (++). One minor comment: Please try to use consistent indentation. From my quick eye-parse of the code, PandaRaey is using consistent indentation. It's just that their choice of indentation scheme (Whitesmiths) is unusual to see in Perl.	[reply]
Re^3: Script help needed - RegEx & Hashes by haukex (Archbishop) on Oct 10, 2018 at 11:34 UTC
Yes, you're right! (When I skimmed the code, the final `close`s jumped out at me as looking off, I'm obviously not used to this style.) Sorry, PandaRaey, this is Perl and There Is More Than One Way To Do It, you're free to use whatever indentation style you like, as long as it's consistent (which it is in this case). If I may make a minor comment though: some more whitespace in statements like `foreach$folder(@folders)` would make them a little easier to read, IMHO.	[reply] [d/l] [select]
Re: Script help needed - RegEx & Hashes by hippo (Archbishop) on Oct 10, 2018 at 10:58 UTC
`open(MERGE,">merge"); #open new file to save the new sortet stuff in` You are doing this inside the foreach loop so each time round the loop the contents of this file get clobbered (ie. erased) on the new call to open. Use append mode instead: `open (MERGE, '>>merge');` [download] Alternatively you could open the file once before starting the foreach loop and close it once after the end. There are pros and cons to both approaches neither of which should probably concern you today. Let us know if this solves it for you.	[reply] [d/l] [select]
Re: Skript help needed - RegEx & Hashes by PandaRaey (Novice) on Oct 10, 2018 at 12:56 UTC
Thank you all already very much for the super fast replies. I will work myself through the tips and tricks and post an update as soon as I can. Because it was asked for here a more detailed description of my problem: I have several folders which all contain two specific files. One that ends on ".mapped_sequences" and one that always has the same name "unitas.tRF-table.txt". The mapped_sequences file looks like this with always a number and a gene sequence: `>1 CCTCCTCTACCTCATCCCAGTT >1 GGGTTCGATTCCCGGTCAGGGAT` [download] The other file looks like this (without the four header lines and just a few example lines as the whole file is a bit big): source_tRNA 5p-tR-halves (fractionated) 5p-tR-halves (absolute) + 5p-tRFs (fractionated) 5p-tRFs (absolute) 3p-tR-halves (fra +ctionated) 3p-tR-halves (absolute) 3p-CCA-tRFs (fractionated) + 3p-CCA-tRFs (absolute) 3p-tRFs (fractionated) 3p-tRFs (absolu +te) tRF-1 (fractionated) tRF-1 (absolute) tRNA-leader (fract +ionated) tRNA-leader (absolute) misc-tRFs (fractionated) mis +c-tRFs (absolute) MT-TL2 0 0 0 0 0 0 0 0 0 0 0 0 +6.16666666666667 18 0 0 MT-TL2-ENSG00000210191.1 1 1 4 4 0 0 0 0 0 + 0 0 0 0 0 124 124 MT-TM 0 0 0 0 0 0 0 0 0 0 6 6 0 + 0 0 0 MT-TM-ENSG00000210112.1 13 13 9 9 0 0 0 0 0 + 0 0 0 0 0 40.8333333333333 43 MT-TN 0 0 0 0 0 0 0 0 0 0 1.5 3 + 2 2 0 0 MT-TN-ENSG00000210135.1 0 0 1 1 0 0 0 0 0 + 0 0 0 0 0 25.25 26 MT-TP 0 0 0 0 0 0 0 0 0 0 2 2 0 + 0 0 0 tRNA-Ala-AGC-1-1 0 0 0.142857142857143 1 0 0 0 + 0 0 0 0 0 0 0 1.21693121693122 10 tRNA-Ala-AGC-11-1 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 9.99444444444444 39 tRNA-Ala-AGC-15-1 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 4.26111111111111 21 tRNA-Ala-AGC-2-1 0 0 0.166666666666667 2 0 0 0 + 0 0.0909090909090909 1 0 0 0 0 1.53835978835979 + 12 tRNA-Ala-AGC-2-2 0 0 0.166666666666667 2 0 0 0 + 0 0.0909090909090909 1 0 0 0 0 1.53835978835979 + 12 tRNA-Ala-AGC-3-1 0 0 0.166666666666667 2 0 0 0 + 0 0.0909090909090909 1 0 0 0 0 1.21693121693122 + 10 tRNA-Ala-AGC-4-1 0 0 5.75 46 0 0 0 0 0 0 + 0 0 0 0 1.17407407407407 13 tRNA-Ala-AGC-5-1 0 0 0.166666666666667 2 0 0 0 + 0 0 0 0 0 0 0 1.21693121693122 10 tRNA-Ala-AGC-6-1 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 2 2 tRNA-Ala-AGC-7-1 0 0 0.166666666666667 2 0 0 0 + 0 0 0 0 0 0 0 1.53835978835979 12 tRNA-Ala-AGC-8-1 0 0 0.5 1 0 0 0 0 0 0 + 0 0 0 0 9.99444444444444 39 tRNA-Ala-AGC-8-2 0 0 0.5 1 0 0 0 0 0 0 + 0 0 0 0 9.99444444444444 39 tRNA-Ala-AGC-9-1 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0.511111111111111 3 tRNA-Ala-AGC-9-2 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0.511111111111111 3 tRNA-Ala-CGC-1-1 0 0 5.75 46 0 0 0 0 0 0 + 0 0 0 0 5.84074074074074 21 tRNA-Ala-CGC-2-1 0 0 5.75 46 0 0 0 0 0 0 + 19 19 1 1 4.75740740740741 21 tRNA-Ala-CGC-3-1 0 0 5.75 46 0 0 0 0 0 0 + 10 10 0 0 6.07407407407407 8 tRNA-Ala-CGC-4-1 0 0 0.166666666666667 2 0 0 0 + 0 0 0 0 0 0 0 1.28835978835979 11 tRNA-Ala-TGC-1-1 0 0 0.166666666666667 2 0 0 0 + 0 0.0909090909090909 1 0 0 0 0 5.12645502645503 + 24 tRNA-Ala-TGC-2-1 0 0 5.75 46 0 0 0 0 0 0 + 0 0 0 0 5.12645502645503 24 tRNA-Ala-TGC-3-1 0 0 5.75 46 0 0 0 0 0 0 + 0 0 0 0 29.2931216931217 74 tRNA-Ala-TGC-3-2 0 0 5.75 46 0 0 0 0 0 0 + 0 0 0 0 29.2931216931217 74 tRNA-Ala-TGC-4-1 0 0 5.75 46 0 0 0 0 0 0 + 0 0 0 0 95.7097883597884 113 tRNA-Ala-TGC-5-1 0 0 0.166666666666667 2 0 0 0 + 0 0 0 0 0 0 0 2.20978835978836 17 tRNA-Ala-TGC-6-1 0 0 0.166666666666667 2 0 0 0 + 0 0.0909090909090909 1 0 0 0 0 0.07407407407407 +41 2 tRNA-Ala-TGC-7-1 0 0 0.166666666666667 2 0 0 0 + 0 0 0 0 0 0 0 2.20978835978836 17 tRNA-Arg-ACG-1-1 0 0 0.2 2 0 0 0.142857142857143 + 1 0 0 13 13 0 0 9.83333333333333 95 [download] So the first task was to count all of the numbers form the first file together (the reads) which is the one thing I got to work and it's doing it very well for all the files. The next task would to re-calculate the numbers in the 2nd file (number/reads*1000000) and afterwards sum together the numbers. As you can see from the 2nd code example there are multiple lines for the same amino-acid combination and all for one combination should be summed up together and saved in a new more organized file (the merged file and only the columns with the fractioned parts). I hope I could somehow explain what this script should do. Regarding the indentation style - what would be a common one? I have to admit I only know this one. I got a book from my professor to find my way into perl and that was the one they used there so I kinda stuck to that. Once again, thank you all ready for the super quick replies. I am very glad I found so much help so quickly ~Panda	[reply] [d/l] [select]
Re: Skript help needed - RegEx & Hashes by jwkrahn (Abbot) on Oct 10, 2018 at 18:17 UTC
`@folders=glob(""); #to get all folders in directory; extension ("") +as wildcard to get all names foreach$folder(@folders) #to speak to each element in directory { next if ($folder!~/^UNITAS_/); #skip elements which do not start w +ith "UNITAS"` [download] Because you are only interested in "folders" that begin with the string "UNITAS_" you can do that with `glob`: `# to get UNITAS_* folders in directory my @folders = glob "UNITAS_*"; # to speak to each element in directory foreach my $folder ( @folders ) {` [download] `$head=<TRF>; #remove the first four lines of the trf-table.txt + file $head=<TRF>; $head=<TRF>; $head=<TRF>;` [download] You don't need a variable to do that: `undef = <TRF>; #remove the first four lines of the trf-table.t +xt file undef = <TRF>; undef = <TRF>; undef = <TRF>;` [download] Or use a loop: `# remove the first four lines of the trf-table.txt file undef = <TRF> for 1 .. 4;` [download]	[reply] [d/l] [select]
Re^2: Skript help needed - RegEx & Hashes by AnomalousMonk (Archbishop) on Oct 11, 2018 at 00:14 UTC
... or just `<TRF> for 1 .. 4;` (n readline-s, no assignment needed). Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Skript help needed - RegEx & Hashes by poj (Abbot) on Oct 10, 2018 at 20:00 UTC
The error is on line 69 `while($line=<TRF>) { # @line=split("\t",$trftable); # error @line=split("\t",$line);` [download] also you probably want the merge file in the same folder as the other 2 files `open MERGE,">","$folder/merge" or die "Could not open $folder/merge : $!";` [download] poj	[reply] [d/l] [select]
Re: Skript help needed - RegEx & Hashes by PandaRaey (Novice) on Oct 11, 2018 at 17:42 UTC
Thank you all so so much for your feedback and the help, you are my heros right now. The errors pointed out by poj, hippo and haukex solved the issue, so when I only make those quick changes (the one variable change ~~still can't believe I overlooked that I used the wrong variable all along~~ and the append mode) in my old script and it worked wonders. I get what I want and I could cry tears of happiness right now. You guys can not imagine the relief I am feeling right now. However I do not just want a working script anymore but actually one that is also looking good. For that I would love to implement all the changes that have been suggested, however I am running in several problems while doing so. 1) when I try to declare the variables when I need them for some reasons I still get the following warning: Global symbol "VARIABLE" requires explicit package name (did you forget to declare "my VARIABLE"?) 2) While trying to use the newer suggested way of opening files I get the following warning: Scalar found where operator expected at new_reads.pl line 24, near """$folder" (Missing operator before $folder?) I don't really understand why these errors are occurring, but I would love to get it fixed simply because I want my script not to be an "outdated" thing, using old ways of handling files and variables lol. I copied the "newer" version of the script below and once again, thank all of you for your help. #!/usr/bin/perl -w use strict; use warnings; #Initiate all variables, hashes and co #Open folders in working directy my @folders = glob(""); #to get all folders in directory; extension ( +"") as wildcard to get all names foreach my $folder(@folders) #to speak to each element in directory { next if ($folder!~/^UNITAS_/); #skip elements which do not start w +ith "UNITAS" opendir(DIR,$folder)\|\|die print$!; #open folder, end script when o +pening is not possible (DIR is the "filehandle" for the directory) print"\n$folder"; while( my $file=readdir(DIR)) #returns content of folder { next if($file!~/\.mapped_sequences$/); #get the mapped_sequenc +es file we need to read out the reads print"\n$file"; #print out file names to make sure we get the +right files my $reads = 0; #set the number of reads to 0 for each run open my $fileone, '<', "$folder/$file" or die ""$folder/$file" +: $!"; while(my $tocount=<$fileone>)#read file { chomp $tocount; $tocount =~ s/>//g; #remove all ">" next if ($tocount =~ /[A-Za-z]/); #skip lines which contai +n the sequence if ($tocount =~ /[0-9]/) #get the read-number { print"\n$tocount"; $reads = ($reads + $tocount); # add up all reads } print"\n$reads"; } close $fileone; my $trftable = 'unitas.tRF-table.txt'; #save file name in vari +able open my $trf, '<', "$folder/$trftable" or die ""$folder/$trfta +ble": $!"; undef = <$trf> for 1 .. 4; my %hash = (); #initiate empty hash while( my $line=<$trf>) { chomp $line; my @line=split("\t",$line); if($line[0]=~s/tRNA-[^-]+-...//) # "tRNA-"(matched tRNA un +d -) "[^-]+" beginning bis Ende, egal was "-..."(weiterer Strich bis +Ende) { my $tRNAname=$line[0]; $tRNAname=$&; # "$&" = last pattern match print"\n$tRNAname"; } else { my $tRNAname=$line[0]; $tRNAname=~s/-ENS.+$//; # "-ENS.+$" ( matched allen di +e -ENS. bis Ende enthalten) print"\n$tRNAname"; } my $hash{$tRNAname}{"5p-tR-halves"}+=$line[1]/$reads10000 +00; $hash{$tRNAname}{"5p-tRFs"}+=$line[3]/$reads1000000; $hash{$tRNAname}{"3p-tR-halves"}+=$line[5]/$reads1000000; $hash{$tRNAname}{"3p-CCA-tRFs"}+=$line[7]/$reads1000000; $hash{$tRNAname}{"3p-tRFs"}+=$line[9]/$reads1000000; $hash{$tRNAname}{"tRF-1"}+=$line[11]/$reads1000000; $hash{$tRNAname}{"tRNA-leader"}+=$line[13]/$reads1000000; $hash{$tRNAname}{"misc-tRFs"}+=$line[15]/$reads1000000; } open my $merge,">>","$folder/$merge" or die "Could not open $f +older/$merge : $!"; my @tRF_types=("5p-tR-halves","5p-tRFs","3p-tR-halves","3p-CCA +-tRFs","3p-tRFs","tRF-1","tRNA-leader","misc-tRFs"); foreach $tRNAname(sort{$a cmp $b}keys%hash) #sortiert die alph +abetisch nach keys { print MERGE $tRNAname; # print tRNA name foreach my $tRF_type(@tRF_types) { print MERGE"\t$hash{$tRNAname}{$tRF_type}"; # print co +unts for each tRF type separated by tab } print MERGE"\n";# print newline } close TRF; close MERGE; close DIR; } } [download]	[reply] [d/l]
Re^2: Skript help needed - RegEx & Hashes by poj (Abbot) on Oct 11, 2018 at 18:11 UTC
A few typos to correct line 24 - remove double " #open my $fileone, '<', "$folder/$file" or die ""$folder/$file": $!"; open my $fileone, '<', "$folder/$file" or die "$folder/$file: $!"; line 44 - same #open my $trf, '<', "$folder/$trftable" or die ""$folder/$trftable": $!"; open my $trf, '<', "$folder/$trftable" or die "$folder/$trftable: $!"; line 46 remove undef #undef = <$trf> for 1 .. 4; <$trf> for 1 .. 4; line 49 add declare here to expand scope of variable my $tRNAname; line 57 - remove my #my $tRNAname=$line[0]; $tRNAname=$line[0]; line 63 remove my #my $tRNAname=$line[0]; $tRNAname=$line[0]; line 68 - remove my ( as %hash declared earlier ) #my $hash{$tRNAname}{"5p-tR-halves"}+=$line1/$reads1000000; $hash{$tRNAname}{"5p-tR-halves"}+=$line1/$reads1000000; line 69 - change $merge after $folder to merge #open my $merge,">>","$folder/$merge" or die "Could not open $folder/$merge : $!"; open my $merge,">>","$folder/merge" or die "Could not open $folder/merge : $!"; line 84..94 change MERGE to $merge print MERGE $tRNAname; # print tRNA name foreach my $tRF_type(@tRF_types) { print MERGE"\t$hash{$tRNAname}{$tRF_type}"; # print counts for each tRF type separated by tab } print MERGE"\n";# print newline } close TRF; close MERGE; close DIR; poj	[reply]
Re^3: Script help needed - RegEx & Hashes by hippo (Archbishop) on Oct 11, 2018 at 20:44 UTC
`line 24 - remove double " #open my $fileone, '<', "$folder/$file" or die ""$folder/$file": $!";` [download] In case it isn't obvious to anyone reading this, the problem with `die ""$folder/$file": $!";` is that there's nothing to say that the second " is a literal character and not simply closing the empty string started by the first " character. Here are some alternatives: `die "'$folder/$file': $!"; # choose a different literal character die "\"$folder/$file\": $!"; # escape the inner " die qq{"$folder/$file": $!}; # use an alternative outer delimiter` [download]	[reply] [d/l] [select]
Re^2: Skript help needed - RegEx & Hashes by haukex (Archbishop) on Oct 12, 2018 at 11:20 UTC
Here's how I might have written that script, with the following changes to your version: Formatting: I've used my personal formatting style; a matter of taste of course (and sometimes I even vary my own style, if I think it looks better another way). For example, I added a bunch of whitespace and removed a couple of parens where it's not strictly necessary (but you are free to add parens if you like). I wrapped a few long lines so they would display nicely here, but usually I write my `open ... or die ...` on one line if it fits reasonably. Don't need both `#!/usr/bin/perl -w` and `use warnings;` (What's wrong with -w and `$^W`) I used the glob suggestion from jwkrahn, and also made sure that it would only return directories with the -d filetest operator. Note that glob has quite a few caveats, but with fixed strings it's ok. I applied most of the changes suggested by poj and others. Just to shorten the code a bit, I used an intermediate hash reference `$h` for `$hash{$tRNAname}`, in order for this to work I had to make sure to initialize `$hash{$tRNAname}` with an empty anonymous hash: `my $h = ( $hash{$tRNAname} //= {} )` means "assign `{}` (an empty anonymous hash) to `$hash{$tRNAname}` if the latter is not yet defined, then assign the value of `$hash{$tRNAname}` to `$h`". (See also perlreftut and perlref.) Update: And see hippo's reply for one way to shorten it even more. You were closing the directory handle too early, and I had to change the scoping of a couple of variables like `$tRNAname`. I switched to using Data::Dumper to output the variables, which I configured in a way that I like the output better (although normally I'd use Data::Dump; Date::Dumper is a core module). BTW, I'm not sure why you were prefixing the `\n` in your `print`s, but normally one would do things like `print "$tocount\n";` You said `open my $merge,">>","$folder/$merge"`, but the latter variable doesn't yet exist at that point (`my $merge` doesn't take effect until after the `open` statement), and in your original script you said `open(MERGE,">merge")`, so I'm not sure if you want a `merge` file per folder, or a single `merge` file in the current working directory? If it's the former, the probably hippo's suggestion of opening the file once at the top of the script is better, also then you don't have to use append mode. I'm not sure about `if ( $tocount =~ /[0-9]/ )`: If you want to make sure that it contains only digits, you should anchor your regex, as in ~~`/^[0-9]$/`~~ `/^[0-9]+$/`. Plus I made a few other tweaks and used idioms in a few places, such as `( $tRNAname = $line[0] ) =~ s/-ENS.+$//`, which means "copy `$line[0]` to `$tRNAname` and then apply the regex to `$tRNAname`". `{$a cmp $b}` is the default sort order and isn't really needed, unless you really want to be explicit (it doesn't hurt). Please have a look, and if you have any questions, please let us know. #!/usr/bin/env perl use warnings; use strict; use Data::Dumper; $Data::Dumper::Useqq = 1; $Data::Dumper::Quotekeys = 0; $Data::Dumper::Sortkeys = 1; for my $folder ( grep {-d} glob('UNITAS_') ) { print Data::Dumper->Dump([$folder], [qw/folder/]); opendir my $dh, $folder or die "$folder: $!"; while ( my $file = readdir($dh) ) { next if $file !~ /\.mapped_sequences$/; print Data::Dumper->Dump([$file], [qw/file/]); my $reads = 0; open my $fileone, '<', "$folder/$file" or die "$folder/$file: $!"; while ( my $tocount = <$fileone> ) { chomp $tocount; $tocount =~ s/>//g; next if $tocount =~ /[A-Za-z]/; if ( $tocount =~ /[0-9]/ ) { print Data::Dumper->Dump([$tocount], [qw/tocount/]); $reads += $tocount; } } close $fileone; print Data::Dumper->Dump([$reads], [qw/reads/]); my %hash; my $trftable = 'unitas.tRF-table.txt'; open my $trf, '<', "$folder/$trftable" or die "$folder/$trftable: $!"; <$trf> for 1 .. 4; while ( my $line = <$trf> ) { chomp $line; my @line = split /\t/, $line; #print Data::Dumper->Dump([\@line], [qw/line/]); my $tRNAname; if ( $line[0] =~ s/tRNA-[^-]+-...// ) { $tRNAname = $& } else { ( $tRNAname = $line[0] ) =~ s/-ENS.+$// } print Data::Dumper->Dump([$tRNAname], [qw/tRNAname/]); my $h = ( $hash{$tRNAname} //= {} ); $h->{"5p-tR-halves"} += $line[ 1] / $reads * 1000000; $h->{"5p-tRFs"} += $line[ 3] / $reads * 1000000; $h->{"3p-tR-halves"} += $line[ 5] / $reads * 1000000; $h->{"3p-CCA-tRFs"} += $line[ 7] / $reads * 1000000; $h->{"3p-tRFs"} += $line[ 9] / $reads * 1000000; $h->{"tRF-1"} += $line[11] / $reads * 1000000; $h->{"tRNA-leader"} += $line[13] / $reads * 1000000; $h->{"misc-tRFs"} += $line[15] / $reads * 1000000; } close $trf; print Data::Dumper->Dump([\%hash], [qw/hash/]); open my $merge, '>>', "$folder/merge" or die "$folder/merge: $!"; my @tRF_types = ("5p-tR-halves", "5p-tRFs", "3p-tR-halves", "3p-CCA-tRFs", "3p-tRFs", "tRF-1", "tRNA-leader", "misc-tRFs"); for my $tRNAname ( sort keys %hash ) { print $merge $tRNAname; for my $tRF_type (@tRF_types) { print $merge "\t$hash{$tRNAname}{$tRF_type}"; } print $merge "\n"; } close $merge; } close $dh; } [download] For the sample data from this post, the output file `merge` I get is the following. Note that if you re-run the script, because of the append mode on the `merge` file, the same lines get added to that file again. `MT-TM 6500000 4500000 0 0 0 0 0 20416666.66666 +66 MT-TN 0 500000 0 0 0 750000 1000000 12625000 MT-TP 0 0 0 0 0 1000000 0 0 tRNA-Ala-AGC 0 3863095.23809524 0 0 136363.636363636 + 0 0 23353306.8783069 tRNA-Ala-CGC 0 8708333.33333333 0 0 0 14500000 50 +0000 8980291.00529101 tRNA-Ala-TGC 0 11833333.3333333 0 0 90909.0909090909 + 0 0 84521296.2962963 tRNA-Arg-ACG 0 100000 0 71428.5714285715 0 6500000 + 0 4916666.66666667` [download] Update: Minor edits and a few additions to the explanations.*	[reply] [d/l] [select]
Re^3: Script help needed - RegEx & Hashes (DRY) by hippo (Archbishop) on Oct 12, 2018 at 12:27 UTC
`$h->{"5p-tR-halves"} += $line[ 1] / $reads * 1000000; $h->{"5p-tRFs"} += $line[ 3] / $reads * 1000000; $h->{"3p-tR-halves"} += $line[ 5] / $reads * 1000000; $h->{"3p-CCA-tRFs"} += $line[ 7] / $reads * 1000000; $h->{"3p-tRFs"} += $line[ 9] / $reads * 1000000; $h->{"tRF-1"} += $line[11] / $reads * 1000000; $h->{"tRNA-leader"} += $line[13] / $reads * 1000000; $h->{"misc-tRFs"} += $line[15] / $reads * 1000000;` [download] Years ago that would not have bothered me but nowadays it makes me twitch. YMMV but for DRY: `my $i = 1; for (qw/5p-tR-halves 5p-tRFs 3p-tR-halves 3p-CCA-tRFs 3p-tRFs tRF-1 tR +NA-leader misc-tRFs/) { $->{$_} += $line[$i] / $reads * 1000000; $i += 2; # Odd entries only }` [download]	[reply] [d/l] [select]
Re^4: Script help needed - RegEx & Hashes (DRY) by haukex (Archbishop) on Oct 12, 2018 at 12:30 UTC
Re^2: Skript help needed - RegEx & Hashes by jwkrahn (Abbot) on Oct 11, 2018 at 18:25 UTC
`undef = <$trf> for 1 .. 4; my %hash = (); #initiate empty hash while( my $line=<$trf>) { chomp $line; my @line=split("\t",$line);` [download] Another way to skip the first four lines: `my %hash = (); #initiate empty hash while ( my $line = <$trf> ) { next if $. < 5; # skip first four lines chomp $line; my @line = split /\t/, $line;` [download]	[reply] [d/l] [select]
Re: Skript help needed - RegEx & Hashes by PandaRaey (Novice) on Oct 15, 2018 at 16:45 UTC
Hello guys, thank all so, so much for your replies. I learned so much from these it's amazing haha. I also apologize for not replying earlier, I had some major issues with the internet and were only able to get back on-line and take a look at your help today. @Haukex; thank you very much. The reason I have been using the formatting as I did, is mainly because I constantly forget to open or close brackets and in the way I did it I have a bit better overview over the brackets I open and close. My guess is, that the more I work with perl and the more I will get used to it I can probably go back to a different formatting style. Another question I would have would be about this: $h->{"3p-tRFs"} . How was it possible to replace the variable call from my version with " -> " ? My second question would be about the Data:Dumper. Isn't it contra-productive to use it when you are working with large data-sets and would end up printing a lot of content into the terminal? Or would you just use the Data:Dumper until you made sure that your script is working? I can definitely see the advantages of the Data:Dumper but I am just a bit worried it could slow things down too much with large data-sets. poj, hippo, jwkrahn thank you all so much for your help and baring with my probably very basic and banal problems. But I definitely learned a lot from your help and I slowly start being less scared about working with perl.	[reply]
Re^2: Skript help needed - RegEx & Hashes by haukex (Archbishop) on Oct 15, 2018 at 19:59 UTC
Glad to help, that's what we're here for :-) Like I said, don't worry too much about the formatting - it's much more important that you apply whatever formatting style you choose consistently, because no matter what formatting, if indentation isn't applied consistently, it's much easier to make mistakes. Using one of the more common styles might make your code a little easier to read to others, but inconsistent indentation is much more problematic. $h->{"3p-tRFs"} . How was it possible to replace the variable call from my version with " -> " ? The Arrow Operator is both the method call operator and the dereferencing operator. For example, I can say: `my %hash = ( hello => "foo", world => "bar"); my $hashref = \%hash; # store a reference to the %hash print $hashref->{hello}; # prints "foo" $hashref->{world} = "quz"; # change "bar" to "quz" in orig. hash` [download] References are explained quite nicely in perlreftut - if you've heard of the concept of pointers, references are kind of like "safer" pointers, and therefore less scary ;-) Two advantages of references are that (a) instead of copying data structures when they are passed as arguments to functions, you can just pass a reference instead, which saves memory and allows the function to modify the original data if desired, and (b) you can build complex data structures out of them, for example an array can contain a list of references to hashes, then you have an AoH (array of hashes); hash values can be references to arrays (HoA, hash of arrays), and all sorts of complex data structures. You can see plenty of examples of the latter in perldsc, and references are explained in detail in perlref. An "anonymous" hash or array is called that because it doesn't have a name. In "`my $hashref = \%hash;`", the hash being referenced by `$hashref` has a name, `%hash`. In "`my $hashref = {};`", this does basically the same as the previous piece of code, but now the hash referenced by `$hashref` is newly created and does not have a name (quite useful when building nested data structures). ... Data:Dumper. Isn't it contra-productive to use it when you are working with large data-sets and would end up printing a lot of content into the terminal?* It's just intended for debugging output - I figured that's what you wanted because you were using `print`s in your original code. It's no problem to comment them out, or probably even better to do something like: `my $DEBUG = 0; # at the top of the program ... $DEBUG and print Dumper(...); # or print Dumper(...) if $DEBUG;` [download] I personally prefer the former because it's visually a bit easier to skip those lines when you're skimming the code. Or, if performance is a concern, the following will be optimized away (the disadvantage being it's a bit trickier to change a constant via a command-line option): `use constant DEBUG => 0; ... DEBUG and print Dumper(...);` [download] Minor edits for clarity. * Update 2: This statement applies when using the common ways to access arguments: `sub abc { my ($foo,$bar,...) = @_ }` and `sub abc { my $foo=shift; my $bar=shift; ... }`. Don't worry about accessing the elements of `@_` directly just yet (`$_[0]`, `$_[1]`, ...), that's a topic for another day.	[reply] [d/l] [select]
Re^2: Skript help needed - RegEx & Hashes by Anonymous Monk on Oct 16, 2018 at 00:08 UTC
> I constantly forget to open or close brackets You can get a wonderful Perl editor free from Activestate, which will insert a closing bracket any time you type an open bracket: www.activestate.com/komodo-ide/downloads/ide	[reply]
Re^3: Skript help needed - RegEx & Hashes by Anonymous Monk on Oct 16, 2018 at 00:36 UTC
a lot of non commercial editors do that too	[reply]
Re^4: Skript help needed - RegEx & Hashes by Anonymous Monk on Oct 16, 2018 at 03:52 UTC