Re: How to make a hash to evaluate columns between large datasets

Hi rambosauce and welcome to the monastery.

I'm a perl novice

I wouldn't have guessed that, your code is quite well written.

To answer your question, here is something you can do:

my %transcripts;
{
   open(my $transcripts_fh, "<", $transcripts_qfn)
      or die("Can't open \"$transcripts_qfn\": $!\n");
   while (<$transcripts_fh>) {
      chomp;
      my @refs = split(/\t/, $_);
      my ($ref_chr, $ref_strand) = @refs[0, 6];
      $transcripts{$ref_chr}{$ref_strand} = {start => $refs[3], end =>
+ $refs[4], info => $refs[8]};
   }    
}
[download]

(Edit: untested for lack of input data)
Now when you are reading the second file, rather than going through all transcripts, you can directly obtain

my $transcript = $transcripts{$chr}{$strand};
my $start = $transcript->{start};
my $end   = $transcript->{end};
my $info  = $transcript->{info};
[download]

That's assuming that $ref_chr and $ref_strand are a unique pair. If you can have several start/end/info values for a given chr-strand pair, you'll have to use an intermediate array (I didn't want to give the more complex solution if the simple one is enough).

FYI, for debugging you can easer have something like:

use Data::Dump "pp";
...
say pp \%transcripts; # debug the content of %transcripts
[download]

use Data::Dumper;
...
say Dumper \%transcripts;
[download]

The first looks nicer, but Data::Dumper doesn't require an installation.

Comment on Re: How to make a hash to evaluate columns between large datasets Select or Download Code

Replies are listed 'Best First'.
Re^2: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 13:08 UTC
Hi Eily, thank you for the greeting, the compliment, and the debugging suggestions. I already like how you have defined the scalars, it is much cleaner and easier to read than my original. Unfortunately I will have to create an intermediate array as I will have several ref start/end/info values for a given chr/strand pair. I posted a sample of the reference file below with unnecessary information dotted out: `1 . . 14404 14501 . - . Name=DDX11 1 . . 15005 15038 . - . Name=ACTB` [download] Cheers!	[reply] [d/l]
Re^3: How to make a hash to evaluate columns between large datasets by Eily (Monsignor) on Aug 23, 2018 at 13:15 UTC
Ok then you can do it this way: `my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; my $values= {start => $refs[3], end => $refs[4], info => $refs[8 +]}; push @{ $transcripts{$ref_chr}{$ref_strand} }, $values; } } # You should really debug the output at least once with Data::Dumper t +o see how it looks like` [download] As you can see, this one is trickier. @{ something } uses "something" as an array ref, and since in this case "something" is a subelement of a structure, perl will create the array for you if it doesn't exist (this is autovivification). Now you use that like this: `my $transcripts_array = $transcripts{$chr}{$strand}; # might need a be +tter name for my $transcript (@$transcripts_array) { my $start = $transcript->{start}; ... }` [download] Can a given line in the second file match several transcripts, or can you stop looking when you have a match?	[reply] [d/l] [select]
Re^4: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 21:17 UTC
This is great! Thanks a ton! As a test I tried my old script vs the better one with your method on just 10 lines on my not high-end work computer: Elapsed time with your script: 00:00:00.959781 Elapsed time with my original: 00:00:02.324184 Multiply this difference by a few hundred thousand for the complete input files, and you can really note the improvement.	[reply]
Re^5: How to make a hash to evaluate columns between large datasets by FreeBeerReekingMonk (Deacon) on Aug 25, 2018 at 00:30 UTC
Re^4: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 13:50 UTC
Great Eily, thanks. I will test this. But in the meantime just to answer the question, it is possible that an entry in the input file could map to multiple transcripts.	[reply]