How to make a hash to evaluate columns between large datasets

rambosauce has asked for the wisdom of the Perl Monks concerning the following question:

Hello and thanks in advance

I'm a perl novice and have some trouble with the following script being inefficient. I have a data input file containing sequences and their positions in the human genome, with each line as a separate sequence. I want to compare the positions of each sequence to a reference file and extract the name of the gene, which is the last column in the reference file. Both files are tab delimited and have a lot of extraneous information. The script goes line by line from the input and for each line searches criteria in the reference line by line. My script works, but as both of these files are large, this can mean runtime of several days. Is there any way to hash one of the files, so that it only iterates N times, vs N*M times, where N and M are line counts for each file? Any other advice would be greatly appreciated.

#!/usr/bin/perl

use strict;
use warnings;
use feature qw( say );

#Take input and output names from command line
my $in_qfn          = $ARGV[0];
my $out_qfn         = $ARGV[1];

my $transcripts_qfn = "/path/HumanTranscriptInfo";

my @transcripts;
{
   open(my $transcripts_fh, "<", $transcripts_qfn)
      or die("Can't open \"$transcripts_qfn\": $!\n");
   while (<$transcripts_fh>) {
      chomp;
      push @transcripts, [ split(/\t/, $_) ];
   }    
}

{
   open(my $in_fh, "<", $in_qfn)
      or die("Can't open \"$in_qfn\": $!\n");
   open(my $out_fh, ">", $out_qfn)
      or die("Can't create \"$out_qfn\": $!\n");
while (<$in_fh>) {
      chomp;

      #define what each column is to print later
      my ($ID, $strand, $chr, $start, $sequence, $quality, $numPositio
+ns, $mismatches) = split(/\t/, $_);

      #define an end position based on sequence length
      my $end = $start+length($sequence);
      
      #define useful info from reference file
      for my $transcript (@transcripts) {
         my $ref_chr    = $transcript->[0];
         my $ref_strand = $transcript->[6];
         my $ref_start  = $transcript->[3];
         my $ref_end    = $transcript->[4];
         my $info = $transcript->[8];
         
         #now compare values, and if there is a match for strand, chro
+mosome, and the start and stop within reference range range, print va
+lues
         if ($chr eq $ref_chr && $strand eq $ref_strand && $end >= $re
+f_start && $end <= $ref_end) {
            say $out_fh join("\t", $ID, $strand, $chr, $start, $end, $
+sequence, $numPositions, $mismatches, $info);
         }
         
         }
     }
}
}
[download]

Comment on How to make a hash to evaluate columns between large datasets Download Code

Replies are listed 'Best First'.
Re: How to make a hash to evaluate columns between large datasets by Eily (Monsignor) on Aug 23, 2018 at 12:49 UTC
Hi rambosauce and welcome to the monastery. I'm a perl novice I wouldn't have guessed that, your code is quite well written. To answer your question, here is something you can do: `my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; $transcripts{$ref_chr}{$ref_strand} = {start => $refs[3], end => + $refs[4], info => $refs[8]}; } }` [download] (Edit: untested for lack of input data) Now when you are reading the second file, rather than going through all transcripts, you can directly obtain `my $transcript = $transcripts{$chr}{$strand}; my $start = $transcript->{start}; my $end = $transcript->{end}; my $info = $transcript->{info};` [download] That's assuming that $ref_chr and $ref_strand are a unique pair. If you can have several start/end/info values for a given chr-strand pair, you'll have to use an intermediate array (I didn't want to give the more complex solution if the simple one is enough). FYI, for debugging you can easer have something like: `use Data::Dump "pp"; ... say pp \%transcripts; # debug the content of %transcripts` [download] or `use Data::Dumper; ... say Dumper \%transcripts;` [download] The first looks nicer, but Data::Dumper doesn't require an installation.	[reply] [d/l] [select]
Re^2: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 13:08 UTC
Hi Eily, thank you for the greeting, the compliment, and the debugging suggestions. I already like how you have defined the scalars, it is much cleaner and easier to read than my original. Unfortunately I will have to create an intermediate array as I will have several ref start/end/info values for a given chr/strand pair. I posted a sample of the reference file below with unnecessary information dotted out: `1 . . 14404 14501 . - . Name=DDX11 1 . . 15005 15038 . - . Name=ACTB` [download] Cheers!	[reply] [d/l]
Re^3: How to make a hash to evaluate columns between large datasets by Eily (Monsignor) on Aug 23, 2018 at 13:15 UTC
Ok then you can do it this way: `my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; my $values= {start => $refs[3], end => $refs[4], info => $refs[8 +]}; push @{ $transcripts{$ref_chr}{$ref_strand} }, $values; } } # You should really debug the output at least once with Data::Dumper t +o see how it looks like` [download] As you can see, this one is trickier. @{ something } uses "something" as an array ref, and since in this case "something" is a subelement of a structure, perl will create the array for you if it doesn't exist (this is autovivification). Now you use that like this: `my $transcripts_array = $transcripts{$chr}{$strand}; # might need a be +tter name for my $transcript (@$transcripts_array) { my $start = $transcript->{start}; ... }` [download] Can a given line in the second file match several transcripts, or can you stop looking when you have a match?	[reply] [d/l] [select]
Re^4: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 21:17 UTC
Re^5: How to make a hash to evaluate columns between large datasets by FreeBeerReekingMonk (Deacon) on Aug 25, 2018 at 00:30 UTC
Re^4: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 13:50 UTC
Re: How to make a hash to evaluate columns between large datasets by TheloniusMonk (Sexton) on Aug 23, 2018 at 12:08 UTC
Hello and welcome. Can you pls give an indication of how many keys would need to be available from the reference file?	[reply]
Re^2: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 12:56 UTC
Hi Thelonius. I would need to compare four keys in in the reference, and if those four match, return only an additional key which is the final column.	[reply]
Re^3: How to make a hash to evaluate columns between large datasets by TheloniusMonk (Sexton) on Aug 23, 2018 at 13:32 UTC
In this case you just need to put the reference information in a hash from the reference file and use that for look up instead of an array. A hash has the syntax `my %hashname = (); # init empty hash $hashname{key1} = "value1"; $hashname{key2} = { x => 'y' }; # nests an anonymous hash as the value + for key2` [download] often you want to store a hash reference in a scalar to avoid referring to the entire hash, e.g. :- `my $href = \%hash;` [download] hope this helps	[reply] [d/l] [select]
Re: How to make a hash to evaluate columns between large datasets by rozcovo (Initiate) on Aug 23, 2018 at 15:16 UTC
Hello Friend, Just like you, I'm a beginner in this language, apparently this is a computationally difficult task and can be optimized with the use of threads, I will try to do an implementation like this but, please, answer me a long input file	[reply]
Re^2: How to make a hash to evaluate columns between large datasets by Laurent_R (Canon) on Aug 23, 2018 at 16:46 UTC
Hi rozcovo, IMHO, this is not a computationally difficult task. It really boils down to first loading the reference data into a hash and, then, read a single input file and lookup into the hash. Quite simple. And since there is apparently only one data input file, I doubt that using threads will bring any performance benefit.	[reply]
Re^2: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 20:22 UTC
Here is a head from my input file, and the columns are the following information: ID, strand, chromosome, start, sequence, quality score, and positions in the genome. The last two are unnecessary for what I need, so the script is only defining strand, chromosome, start, and length of sequence to find the end. I use these to then parse through the reference file to grab the info in the last column of the reference, and append most of the info from the original input. I chose this header as it has some of info I hope to overlook, such as chromosome missing in the reference (line 1) and different sites on the same chromosome (lines 3 and 4). 3-51568 + HSV1_17 9285 TGGGCAAACACTTGGGGACTG IIIIIIIIII +IIIIIIIIIII 0 2-70337 + KI270733.1 135235 TCGCTGCGATCTATTGAAAGTCAGCCCTCG +ACACAAGGGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 + 2-70337 + 21 8446166 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 2-70337 + 21 8218896 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 2-70337 + GL000220.1 118372 TCGCTGCGATCTATTGAAAGTCAGCCCTCG +ACACAAGGGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 + 2-70337 + 21 8401935 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 1-130983 + 2 32916254 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916255 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916256 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916257 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 [download]	[reply] [d/l]