Ch1ralS0ul has asked for the wisdom of the Perl Monks concerning the following question:

I am attempting to write a piece of code that reads in a couple of files, takes the first two columns from a .tsv files as the Gene ID and Gene Symbol for a key/value pair in a hash, uses the second file to read in some data that corresponds to the Gene ID, and then prints the data to a new file. I am fairly new to programming and so of course am running into an uninitialized variable brick wall. I'm not sure if this is due to an improper array split, a logic issue with my RegEx or if it is one of those little brain farts. Anyways, any advice would be helpful! The uninitialized variable I'm running into is the first column of the array: "INF1Array[0]" on line 34. (Sorry if my formatting is a little unorthodox!)

#!/usr/bin/perl use warnings; use diagnostics; # Title: convertDataToGeneSymbol.pl # Author: Nicholas Bense # Date: 11/4/15 # Open a filehandle to read file #1 open(INF1,"<",'/scratch/Drosophila/fb_synonym_fb_2014_05.tsv' ) or + die $!; # Open a filehandle to read file #2 open(INF2,"<",'/scratch/Drosophila/FlyRNAi_data_baseline_vs_EGF.tx +t') or die $!; # Open a filehandle to read file #3 open(INF3,"<",'/scratch/Drosophila/gene_association.goa_fly') or d +ie $!; # Open a filehandle to write new file open(OUTF1,">",'FLYRNAi_data_baseline_vs_EGFSymbol.txt') or die $! +; # Initialize a hash for the gene symbol conversion my %geneSymbolConversion; # Read Input File 1 line by line while (<INF1>){ # Get rid of whitespace chomp; # Split the line my @INF1Array = split("\t", $_); # Filter entries starting with FBgn while ($INF1Array[0] =~ /(^FBgn\d+)/){ # Assign column 1 to hash key scalar my $geneID = $INF1Array[0]; # Assign column 2 to hash value scalar my $geneSymbol = $INF1Array[1]; # Assign key and value to hash $geneSymbolConversion{$geneID} = $geneSymbol; } } # Read Input File 2 line by line while (<INF2>){ # Get rid of whitespace chomp; # Initialize key value in case it is not found my $geneSymbol = "NA"; # Split the line on tabs my ($geneID, $EGF_Baseline, $EGF_Stimulus) = split("\t", $ +_); # Check if the codon is present in the hash if (defined $geneSymbolConversion{$geneID}){ # Get the value associated with the codon from the + hash $geneSymbol = $geneSymbolConversion{$geneID}; } # Join data and print to output file print OUTF1 join( "\t", $geneID, $geneSymbol, $EGF_Baselin +e, $EGF_Stimulus), "\n"; }

P.S. I will also be reading in the third input file /scratch/Drosophila/gene_association.goa_fly to load columns 3 and 5 from the gene association file into a hash- with column 3 (gene symbol) being the key and column 5 (GO term) the value- then use the hash to convert FlyRNAi_data_baseline_vs_EGFSymbol.txt to FlyRNAi_data_baseline_vs_EGF_GO.txt with the gene symbol replaced by the GO term. If you'd like to provide some tips or mention some potential pitfalls based on my apparent coding habits then please go nuts! Was going to make sure I had this portion of the program running correctly before working in that third element. Mucho gracias!

Replies are listed 'Best First'.
Re: Uninitialized Value Hash Lookup Gene Symbol
by grasshopper!!! (Beadle) on Nov 05, 2015 at 17:23 UTC

    Have you any blank lines in the data file this may cause an empty array after the split.Try changing while to if on line 34 this might help.The while on line 34 causes infinite loop change to if.

Re: Uninitialized Value Hash Lookup Gene Symbol
by choroba (Cardinal) on Nov 05, 2015 at 15:55 UTC
    Crossposted to StackOverflow. It's considered polite to inform about crossposting, so that people not attending both sites don't waste their efforts hacking a problem already solved at the other end of the internet.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Sorry about that! Thanks for the tip and I will make sure to remember that from here on out!

Re: Uninitialized Value Hash Lookup Gene Symbol
by graff (Chancellor) on Nov 05, 2015 at 22:23 UTC
    As mentioned in the previous reply, if you're getting a bunch of "uninitialized variable" warnings, it's probably because the second file you're reading from doesn't have the kind of content that you're expecting on some (or any?) lines.

    By the way, understand that "uninitialized" is a warning, not an error. Because you have use warnings; in your script (which is good), you are learning about things that are simply not going as expected; the script continues to run, and empty strings are being printed where you might be expecting non-empty strings. Given that this is the case, try to figure out where your expectations are not being met.

    Here's a modified version of the OP script, reformatted to be more compact (and "more perlish") - note how I'm adding some lines to check for unexpected conditions, and report them:

    #!/usr/bin/perl use strict; use warnings; use diagnostics; my %geneSymbolConversion; my $input1 = '/scratch/Drosophila/fb_synonym_fb_2014_05.tsv'; open(INF1,"<", $input1 ) or die "$input1: $!\n"; while (<INF1>){ chomp; if ( /^FBgn\d+/ ) { my @fields = split "\t"; $geneSymbolConversion{ $fields[0] } = $fields[1]; } } warn sprintf( "loaded %d gene symbols from %s\n", scalar keys %geneSym +bolConversion, $input1 ); my $input2 = '/scratch/Drosophila/FlyRNAi_data_baseline_vs_EGF.txt'; open(INF2,"<", $input2) or die "$input2: $!\n"; open(OUTF1,">",'FLYRNAi_data_baseline_vs_EGFSymbol.txt') or die $!; while (<INF2>) { chomp; my ($geneID, $EGF_Base, $EGF_Stimulus) = split "\t"; if ( $geneID and $EGF_Base and $EGF_Stimulus ) { my $geneSymbol = $geneSymbolConversion{$geneID} || 'NA'; print OUTF1 join("\t", $geneID, $geneSymbol, $EGF_Base, $EGF_S +timulus), "\n"; } else { warn "$input2: line $.: incomplete data: $_\n"; } }
    Also, I added "use strict", and re-arranged things so that declarations of variables are placed closer to where the variables are actually used.

    (UPDATE: I added a bit more to report how many gene symbols were loaded from the first input, just in case that's informative.)

Re: Uninitialized Value Hash Lookup Gene Symbol
by GotToBTru (Prior) on Nov 05, 2015 at 16:55 UTC

    We really need to see some sample data.

    Dum Spiro Spero