in reply to How to make a hash to evaluate columns between large datasets

Hi rambosauce and welcome to the monastery.

I'm a perl novice
I wouldn't have guessed that, your code is quite well written.

To answer your question, here is something you can do:

my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; $transcripts{$ref_chr}{$ref_strand} = {start => $refs[3], end => + $refs[4], info => $refs[8]}; } }
(Edit: untested for lack of input data)
Now when you are reading the second file, rather than going through all transcripts, you can directly obtain
my $transcript = $transcripts{$chr}{$strand}; my $start = $transcript->{start}; my $end = $transcript->{end}; my $info = $transcript->{info};
That's assuming that $ref_chr and $ref_strand are a unique pair. If you can have several start/end/info values for a given chr-strand pair, you'll have to use an intermediate array (I didn't want to give the more complex solution if the simple one is enough).

FYI, for debugging you can easer have something like:

use Data::Dump "pp"; ... say pp \%transcripts; # debug the content of %transcripts
or
use Data::Dumper; ... say Dumper \%transcripts;
The first looks nicer, but Data::Dumper doesn't require an installation.

Replies are listed 'Best First'.
Re^2: How to make a hash to evaluate columns between large datasets
by rambosauce (Novice) on Aug 23, 2018 at 13:08 UTC

    Hi Eily, thank you for the greeting, the compliment, and the debugging suggestions. I already like how you have defined the scalars, it is much cleaner and easier to read than my original. Unfortunately I will have to create an intermediate array as I will have several ref start/end/info values for a given chr/strand pair. I posted a sample of the reference file below with unnecessary information dotted out:

    1 . . 14404 14501 . - . Name=DDX11 1 . . 15005 15038 . - . Name=ACTB

    Cheers!

      Ok then you can do it this way:

      my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; my $values= {start => $refs[3], end => $refs[4], info => $refs[8 +]}; push @{ $transcripts{$ref_chr}{$ref_strand} }, $values; } } # You should really debug the output at least once with Data::Dumper t +o see how it looks like
      As you can see, this one is trickier. @{ something } uses "something" as an array ref, and since in this case "something" is a subelement of a structure, perl will create the array for you if it doesn't exist (this is autovivification).

      Now you use that like this:

      my $transcripts_array = $transcripts{$chr}{$strand}; # might need a be +tter name for my $transcript (@$transcripts_array) { my $start = $transcript->{start}; ... }
      Can a given line in the second file match several transcripts, or can you stop looking when you have a match?

        This is great! Thanks a ton! As a test I tried my old script vs the better one with your method on just 10 lines on my not high-end work computer:

        Elapsed time with your script: 00:00:00.959781

        Elapsed time with my original: 00:00:02.324184

        Multiply this difference by a few hundred thousand for the complete input files, and you can really note the improvement.

        Great Eily, thanks. I will test this. But in the meantime just to answer the question, it is possible that an entry in the input file could map to multiple transcripts.