in reply to How to make a hash to evaluate columns between large datasets

Hello Friend,

Just like you, I'm a beginner in this language,

apparently this is a computationally difficult task and can be optimized with the use of threads, I will try to do an implementation like this but, please, answer me a long input file

  • Comment on Re: How to make a hash to evaluate columns between large datasets

Replies are listed 'Best First'.
Re^2: How to make a hash to evaluate columns between large datasets
by Laurent_R (Canon) on Aug 23, 2018 at 16:46 UTC
    Hi rozcovo,

    IMHO, this is not a computationally difficult task. It really boils down to first loading the reference data into a hash and, then, read a single input file and lookup into the hash. Quite simple. And since there is apparently only one data input file, I doubt that using threads will bring any performance benefit.

Re^2: How to make a hash to evaluate columns between large datasets
by rambosauce (Novice) on Aug 23, 2018 at 20:22 UTC

    Here is a head from my input file, and the columns are the following information: ID, strand, chromosome, start, sequence, quality score, and positions in the genome. The last two are unnecessary for what I need, so the script is only defining strand, chromosome, start, and length of sequence to find the end. I use these to then parse through the reference file to grab the info in the last column of the reference, and append most of the info from the original input.

    I chose this header as it has some of info I hope to overlook, such as chromosome missing in the reference (line 1) and different sites on the same chromosome (lines 3 and 4).

    3-51568 + HSV1_17 9285 TGGGCAAACACTTGGGGACTG IIIIIIIIII +IIIIIIIIIII 0 2-70337 + KI270733.1 135235 TCGCTGCGATCTATTGAAAGTCAGCCCTCG +ACACAAGGGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 + 2-70337 + 21 8446166 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 2-70337 + 21 8218896 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 2-70337 + GL000220.1 118372 TCGCTGCGATCTATTGAAAGTCAGCCCTCG +ACACAAGGGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 + 2-70337 + 21 8401935 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 1-130983 + 2 32916254 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916255 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916256 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916257 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5