msnyder424 has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that looks something like this, which I want to use it to search through the current directory I am in, open, all directories in that directory, open all files that match certain REs (fastq files that have a format such that every four lines go together), do some work with these files, and write some results to a file in each directory. (The actual code is much more complex but since I think I have a structural issue I am showing a simplified version)

#!user/local/perl #Created by C. Pells, M. R. Snyder, and N. T. Marshall 2017 #Script trims and merges high throughput sequencing reads from fastq f +iles for a specific primer set use Cwd; use warnings; my $StartTime= localtime; my $MasterDir = getcwd; #obtains the current directory opendir (DIR, $MasterDir); my @objects = readdir (DIR); closedir (DIR); foreach (@objects){ print $_,"\n"; } my @Dirs = (); foreach my $O (0..$#objects){ my $CurrDir = ""; if ((length ($objects[$O]) < 7) && ($O>1)){ #Checking if the lengt +h of the object name is < 7 characters. All samples are 6 or less. re +moving the first two elements: "." and ".." $CurrDir = $MasterDir."/".$objects[$O]; #appends directory nam +e to full path push (@Dirs, $CurrDir); } } foreach (@Dirs){ print $_,"\n";#checks that all directories were read in } foreach my $S (0..$#Dirs){ my @files = (); opendir (DIR, $Dirs[$S]) || die "cannot open $Dirs[$S]: $!"; @files = readdir DIR; #reads in all files in a directory closedir DIR; my @AbsFiles = (); foreach my $F (0..$#files){ my $AbsFileName = $Dirs[$S]."/".$files[$F]; #appends file name + to full path push (@AbsFiles, $AbsFileName); } foreach my $AF (0..$#AbsFiles){ if ($AbsFiles[$AF] =~ /_R2_001\.fastq$/m){ #finds reverse fast +q file my @readbuffer=(); #read in reverse fastq my %RSeqHash; my $c = 0; print "Reading, reversing, complimenting, and trimming rev +erse fastq file $AbsFiles[$AF]\n"; open (INPUT1, $AbsFiles[$AF]) || die "Can't open file: $!\ +n"; while (<INPUT1>){ chomp ($_); push(@readbuffer, $_); if (@readbuffer == 4) { $rsn = substr($readbuffer[0], 0, 45); #trims rever +se seq name $cc++ % 10000 == 0 and print "$rsn\n"; $RSeqHash{$rsn} = $readbuffer[1]; @readbuffer = (); } } } } foreach my $AFx (0..$#AbsFiles){ if ($AbsFiles[$AFx] =~ /_R1_001\.fastq$/m){ #finds forward fas +tq file print "Reading forward fastq file $AbsFiles[$AFx]\n"; open (INPUT2, $AbsFiles[$AFx]) || die "Can't open file: $! +\n"; my $OutMergeName = $Dirs[$S]."/"."Merged.fasta"; open (OUT, ">", "$OutMergeName"); my $cc=0; my @readbuffer = (); while (<INPUT2>){ chomp ($_); push(@readbuffer, $_); if (@readbuffer == 4) { my $fsn = substr($readbuffer[0], 0, 45); #trims fo +rward seq name #$cc++ % 10000 == 0 and print "$fsn\n$readbuffer[1 +]\n"; if ( exists($RSeqHash{$fsn}) ){ #checks to see if +forward seq name is present in reverse seq hash print "$fsn was found in Reverse Seq Hash\n"; print OUT "$fsn\n$readbuffer[1]\n"; #ACUAL OUT +PUT FILE IS EMPTY!!! } else { $cc++ % 10000 == 0 and print "$fsn not found i +n Reverse Seq Hash\n"; #PRINTS THIS FOR EVERY LINE IN INPUT2!!! } @readbuffer = (); } } close INPUT1; close INPUT2; close OUT; } } }

I know that the script works without iterating over folders because if I run a simplified version within just one folder it works including using the REs to find file names. But with this version I just get empty output files. Due to the print functions I inserted in this script, I've determined that Perl cant find the variable $fsn as a key in %RSeqHash from INPUT1. I cant understand why because each file is there and it works when I don't iterate over folders so I know that the keys match. So either there is something simple I am missing or this is some sort of limitation to Perl's memory that I have found. Any help is appreciated!

  • Comment on Running a script across multiple directories with multiple output files (problems comparing hash key values)
  • Download Code

Replies are listed 'Best First'.
Re: Running a script across multiple directories with multiple output files (problems comparing hash key values)
by huck (Prior) on Aug 08, 2017 at 00:33 UTC

    I may have went in a wrong direction above.

    my %input1 = (); #initialize input1 hash for my $c (0..$#AbsFiles){ if ($AbsFiles[$c] =~ /R2_001\.fastq$/){ open INPUT1 ... ; stuff to set ... $input1{$key1} ... close INPUT1; } } # c for my $c (0..$#AbsFiles){ if ($AbsFiles[$c] =~ /R1_001\.fastq$/$/){ open INPUT2 ... ; ...stuff to test key2 against $input1{$key2}; close INPUT2; } } # c
    You were resetting the input1 hash every time you opened a file to test for key1.

Re: Running a script across multiple directories with multiple output files (problems comparing hash key values)
by huck (Prior) on Aug 08, 2017 at 00:22 UTC

    You have massive problems with variable scope. A "my" variable only "lives" withing the braces that surround it. Change the top of your file from

    #!user/local/perl use Cwd;
    to this
    #!user/local/perl use strict; use warnings; use Cwd;
    And then try to understand what it is telling you. If you still are confused read this Variable Scoping in Perl: the basics. If after that you don not understand how to fix it come back and show us your progress.

    hint: the key will be to combine your c/d loops into a single loop. something like this

    for my $c (0..$#AbsFiles){ my $key1=undef; my $key2=undef; if ($AbsFiles[$c] =~ /R2_001\.fastq$/){ open INPUT1 ... ; ...stuff to set key1; close INPUT1; } if ($AbsFiles[$c] =~ /R1_001\.fastq$/$/){ open INPUT2 ... ; ...stuff to set key2; close INPUT2; } if (defined($key1) && defined($key2} && key1 eq $key2 ) { ... stuff to do when both are set and equal ... } else { ...stuff to do otherwise .. } } # c

    Note that i "escaped" the dot in the regexp. an escaped dot will match any character, while "\." matches a dot itself.

    Note that the close does not include the lessthan/greaterthan signs, those are used to read a file, not to reference it. Also note i closed them inside the same scope i opened them, this tends to be good practice.

    edit:See below Re: Running a script across multiple directories with multiple output files (problems comparing hash key values)

Re: Running a script across multiple directories with multiple output files (problems comparing hash key values)
by Anonymous Monk on Aug 08, 2017 at 00:29 UTC

    Hi,

    This is the outline of your code

    # foreach (@Dirs) # foreach (@Dirs) # for ( 0 .. $#AbsDirs ) # for ( 0 .. $#files ) # for ( 0 .. $#AbsFiles ) # if( $AbsFiles[$c] =~ /R2_001.fastq$/ ) # while(<INPUT1>) # if( @readbuffer == 4 ) # for ( 0 .. $#AbsFiles ) # if( $AbsFiles[$d] =~ /R1_001.fastq$/ ) # while(<INPUT2>) # if( @readbuffer == 4 ) # if( exists( $input1{$key2} ) ) # else

    This is how you should write code based on that outline

    #!/usr/bin/perl -- use strict; use warnings; use File::Find::Rule qw/ find rule /; use Path::Tiny qw/ path /; my $root = path( grep defined, shift, '.' )->realpath; for my $file ( find( name => qr/R2_001.fastq$/ , in => $root ) ){ OneThing( $file ); } for my $file ( find( name => qr/R1_001.fastq$/, in => $root ) ){ TwoThing( $file ); } exit 0;

    After looking closer at your while loops, this is how you should write that

    for my $file ( find( name => qr/R2_001.fastq$/, in => $root ) ){ SomeThing( $file ); } exit 0; sub SomeThing { my( $onein) = @_; my $twoin = $onein; $twoin =~ s/R2_001.fastq$/R1_001.fastq/; use Path::Tiny qw/ path /; my $out = path( $onein )->realpath( 'Output.fasta' ); OneTwo( $onein, $twoin , $out ); } sub OneTwo { my( $onein, $twoin, $out ) = @_; if( not path( $twoin )->exists ){ warn qq{Skipping "$onein" because "$twoin" does not exist}; return; } ... }