rambosauce has asked for the wisdom of the Perl Monks concerning the following question:

Hello and thanks in advance

I'm a perl novice and have some trouble with the following script being inefficient. I have a data input file containing sequences and their positions in the human genome, with each line as a separate sequence. I want to compare the positions of each sequence to a reference file and extract the name of the gene, which is the last column in the reference file. Both files are tab delimited and have a lot of extraneous information. The script goes line by line from the input and for each line searches criteria in the reference line by line. My script works, but as both of these files are large, this can mean runtime of several days. Is there any way to hash one of the files, so that it only iterates N times, vs N*M times, where N and M are line counts for each file? Any other advice would be greatly appreciated.

#!/usr/bin/perl use strict; use warnings; use feature qw( say ); #Take input and output names from command line my $in_qfn = $ARGV[0]; my $out_qfn = $ARGV[1]; my $transcripts_qfn = "/path/HumanTranscriptInfo"; my @transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; push @transcripts, [ split(/\t/, $_) ]; } } { open(my $in_fh, "<", $in_qfn) or die("Can't open \"$in_qfn\": $!\n"); open(my $out_fh, ">", $out_qfn) or die("Can't create \"$out_qfn\": $!\n"); while (<$in_fh>) { chomp; #define what each column is to print later my ($ID, $strand, $chr, $start, $sequence, $quality, $numPositio +ns, $mismatches) = split(/\t/, $_); #define an end position based on sequence length my $end = $start+length($sequence); #define useful info from reference file for my $transcript (@transcripts) { my $ref_chr = $transcript->[0]; my $ref_strand = $transcript->[6]; my $ref_start = $transcript->[3]; my $ref_end = $transcript->[4]; my $info = $transcript->[8]; #now compare values, and if there is a match for strand, chro +mosome, and the start and stop within reference range range, print va +lues if ($chr eq $ref_chr && $strand eq $ref_strand && $end >= $re +f_start && $end <= $ref_end) { say $out_fh join("\t", $ID, $strand, $chr, $start, $end, $ +sequence, $numPositions, $mismatches, $info); } } } } }

Replies are listed 'Best First'.
Re: How to make a hash to evaluate columns between large datasets
by Eily (Monsignor) on Aug 23, 2018 at 12:49 UTC

    Hi rambosauce and welcome to the monastery.

    I'm a perl novice
    I wouldn't have guessed that, your code is quite well written.

    To answer your question, here is something you can do:

    my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; $transcripts{$ref_chr}{$ref_strand} = {start => $refs[3], end => + $refs[4], info => $refs[8]}; } }
    (Edit: untested for lack of input data)
    Now when you are reading the second file, rather than going through all transcripts, you can directly obtain
    my $transcript = $transcripts{$chr}{$strand}; my $start = $transcript->{start}; my $end = $transcript->{end}; my $info = $transcript->{info};
    That's assuming that $ref_chr and $ref_strand are a unique pair. If you can have several start/end/info values for a given chr-strand pair, you'll have to use an intermediate array (I didn't want to give the more complex solution if the simple one is enough).

    FYI, for debugging you can easer have something like:

    use Data::Dump "pp"; ... say pp \%transcripts; # debug the content of %transcripts
    or
    use Data::Dumper; ... say Dumper \%transcripts;
    The first looks nicer, but Data::Dumper doesn't require an installation.

      Hi Eily, thank you for the greeting, the compliment, and the debugging suggestions. I already like how you have defined the scalars, it is much cleaner and easier to read than my original. Unfortunately I will have to create an intermediate array as I will have several ref start/end/info values for a given chr/strand pair. I posted a sample of the reference file below with unnecessary information dotted out:

      1 . . 14404 14501 . - . Name=DDX11 1 . . 15005 15038 . - . Name=ACTB

      Cheers!

        Ok then you can do it this way:

        my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; my $values= {start => $refs[3], end => $refs[4], info => $refs[8 +]}; push @{ $transcripts{$ref_chr}{$ref_strand} }, $values; } } # You should really debug the output at least once with Data::Dumper t +o see how it looks like
        As you can see, this one is trickier. @{ something } uses "something" as an array ref, and since in this case "something" is a subelement of a structure, perl will create the array for you if it doesn't exist (this is autovivification).

        Now you use that like this:

        my $transcripts_array = $transcripts{$chr}{$strand}; # might need a be +tter name for my $transcript (@$transcripts_array) { my $start = $transcript->{start}; ... }
        Can a given line in the second file match several transcripts, or can you stop looking when you have a match?

Re: How to make a hash to evaluate columns between large datasets
by TheloniusMonk (Sexton) on Aug 23, 2018 at 12:08 UTC
    Hello and welcome. Can you pls give an indication of how many keys would need to be available from the reference file?
      Hi Thelonius. I would need to compare four keys in in the reference, and if those four match, return only an additional key which is the final column.
        In this case you just need to put the reference information in a hash from the reference file and use that for look up instead of an array. A hash has the syntax
        my %hashname = (); # init empty hash $hashname{key1} = "value1"; $hashname{key2} = { x => 'y' }; # nests an anonymous hash as the value + for key2
        often you want to store a hash reference in a scalar to avoid referring to the entire hash, e.g. :-
        my $href = \%hash;
        hope this helps
Re: How to make a hash to evaluate columns between large datasets
by rozcovo (Initiate) on Aug 23, 2018 at 15:16 UTC

    Hello Friend,

    Just like you, I'm a beginner in this language,

    apparently this is a computationally difficult task and can be optimized with the use of threads, I will try to do an implementation like this but, please, answer me a long input file

      Hi rozcovo,

      IMHO, this is not a computationally difficult task. It really boils down to first loading the reference data into a hash and, then, read a single input file and lookup into the hash. Quite simple. And since there is apparently only one data input file, I doubt that using threads will bring any performance benefit.

      Here is a head from my input file, and the columns are the following information: ID, strand, chromosome, start, sequence, quality score, and positions in the genome. The last two are unnecessary for what I need, so the script is only defining strand, chromosome, start, and length of sequence to find the end. I use these to then parse through the reference file to grab the info in the last column of the reference, and append most of the info from the original input.

      I chose this header as it has some of info I hope to overlook, such as chromosome missing in the reference (line 1) and different sites on the same chromosome (lines 3 and 4).

      3-51568 + HSV1_17 9285 TGGGCAAACACTTGGGGACTG IIIIIIIIII +IIIIIIIIIII 0 2-70337 + KI270733.1 135235 TCGCTGCGATCTATTGAAAGTCAGCCCTCG +ACACAAGGGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 + 2-70337 + 21 8446166 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 2-70337 + 21 8218896 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 2-70337 + GL000220.1 118372 TCGCTGCGATCTATTGAAAGTCAGCCCTCG +ACACAAGGGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 + 2-70337 + 21 8401935 TCGCTGCGATCTATTGAAAGTCAGCCCTCGACACAAG +GGTTTGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 4 1-130983 + 2 32916254 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916255 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916256 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5 1-130983 + 2 32916257 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG +GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG IIIIIIIIIIIIIIIIIIIIIIIII +IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 5