I'd like to create a Perl program that accepts as input one file of genome loci (potentially 10's of GB in size) and one region file (typically around 1GB), and I'm trying to think of the best approach to filter the loci file based on the coordinates in the region file.

The loci file looks like this (the lines are truncated here; the important columns are the first two: chromosome and position, although I'd keep the whole like for each match):

#CHROM POS ID REF ALT QUAL FILTER IN 1 874816 . C CT 10 FINAL_AMBG 1 1647893 . C CTTTCTT 30 FINAL_NOT_ 1 7889972 rs57875989 GAGAATCCATCCCATCCTACTGCCAG 1 14106394 . A ACTC 100 FI 1 22332156 . AGG A 10 FI 1 22332161 . T TC 0 FI

The region file looks like this:

1 69091 69290 1 69291 69490 1 69491 69690 1 69691 70008 1 861321 861393 1 865535 865716 1 866418 866469

The first column is the chromosome, the second column is the start coordinate, and the third column is the end coordinate.

What I'd like to do is only keep the lines from the first (loci) file whose 'POS' is between the start and end coordinates on the second (region) file.

A nested while loop would seem to be grossly inefficient, so I was thinking of building a hash of arrays. But before I do, I'd like to draw on the wisdom of the Monks on whether there's a better approach. Execution speed and memory efficiency are paramount. Suggestions much appreciated.


In reply to Best approach for large-scale data processing by iangibson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.