Re^2: How to make a hash to evaluate columns between large datasets

Hi Eily, thank you for the greeting, the compliment, and the debugging suggestions. I already like how you have defined the scalars, it is much cleaner and easier to read than my original. Unfortunately I will have to create an intermediate array as I will have several ref start/end/info values for a given chr/strand pair. I posted a sample of the reference file below with unnecessary information dotted out:

1       .  .    14404   14501   .       -       .       Name=DDX11
1       .  .    15005   15038   .       -       .       Name=ACTB
[download]

Cheers!

Comment on Re^2: How to make a hash to evaluate columns between large datasets Download Code

Replies are listed 'Best First'.
Re^3: How to make a hash to evaluate columns between large datasets by Eily (Monsignor) on Aug 23, 2018 at 13:15 UTC
Ok then you can do it this way: `my %transcripts; { open(my $transcripts_fh, "<", $transcripts_qfn) or die("Can't open \"$transcripts_qfn\": $!\n"); while (<$transcripts_fh>) { chomp; my @refs = split(/\t/, $_); my ($ref_chr, $ref_strand) = @refs[0, 6]; my $values= {start => $refs[3], end => $refs[4], info => $refs[8 +]}; push @{ $transcripts{$ref_chr}{$ref_strand} }, $values; } } # You should really debug the output at least once with Data::Dumper t +o see how it looks like` [download] As you can see, this one is trickier. @{ something } uses "something" as an array ref, and since in this case "something" is a subelement of a structure, perl will create the array for you if it doesn't exist (this is autovivification). Now you use that like this: `my $transcripts_array = $transcripts{$chr}{$strand}; # might need a be +tter name for my $transcript (@$transcripts_array) { my $start = $transcript->{start}; ... }` [download] Can a given line in the second file match several transcripts, or can you stop looking when you have a match?	[reply] [d/l] [select]
Re^4: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 21:17 UTC
This is great! Thanks a ton! As a test I tried my old script vs the better one with your method on just 10 lines on my not high-end work computer: Elapsed time with your script: 00:00:00.959781 Elapsed time with my original: 00:00:02.324184 Multiply this difference by a few hundred thousand for the complete input files, and you can really note the improvement.	[reply]
Re^5: How to make a hash to evaluate columns between large datasets by FreeBeerReekingMonk (Deacon) on Aug 25, 2018 at 00:30 UTC
If you are still implementing threads: Don't forget to use flock while writing to file inside the threads, something like this: `use threads; use Fcntl qw(:flock SEEK_END); if ($end >= $ref->{start} && $end <= $ref->{end}) { lock($out_fh); say $out_fh join("\t", $ID, $strand, $chr, $start, $end, $sequence +, $numPositions, $mismatches\|\|"", $ref->{info}); unlock($out_fh); } sub lock { my ($fh) = @_; flock($fh, LOCK_EX) or die $!; seek($fh, 0, SEEK_END) or die $!; } sub unlock { my ($fh) = @_; flock($fh, LOCK_UN) or die $!; }` [download] Alternatively, use shared objects (collect it in an array or hash) and write it down later.	[reply] [d/l]
Re^4: How to make a hash to evaluate columns between large datasets by rambosauce (Novice) on Aug 23, 2018 at 13:50 UTC
Great Eily, thanks. I will test this. But in the meantime just to answer the question, it is possible that an entry in the input file could map to multiple transcripts.	[reply]